While initially building out my Simple NoAI & NoImageAI plugin for WordPress, a meta directive was all we had at the time. Now that crawlers are using robots.txt to determine whether a site has opted in for web crawling, more well-behaved crawlers are looking for them. Therefor, it’s important to have a list of crawlers, what they’re noted to be used for, and their user agents in case you want to block them.
This information as of Oct 1st, 2025 is up to date, but may change. I will try to keep this list updated as much as I can. Information on this list is also supplemented by Cloudflare’s list of blockable AI crawlers.
Anthropic-AI
User agent: anthropic-ai
User agent: Claude-Web
User agent: ClaudeBot
Anthropic is, as their website states, “an AI safety and research company based in San Francisco.” TechTarget states of this company “Anthropic — founded by former members of Microsoft-backed AI research lab and vendor OpenAI — introduced Claude 2 on July 11. Claude 2 is the second iteration of Claude, the updated version of its AI assistant based on Anthropic’s research.”
Amazon
User agent: Amazonbot
Apple
User agent: Applebot-Extended
User agent: Applebot
Apple recently announced they’re going all in on AI. Per Apple’s website, “The data crawled by Applebot is used to power various features, such as the search technology that is integrated into many user experiences in Appleʼs ecosystem including Spotlight, Siri, and Safari. Enabling Applebot in robots.txt allows website content to appear in search results for Apple users around the world in these products.”
ByteDance/TikTok
User agent: Bytespider
User agent: TikTokSpider
From Darkvisitors.com, “Bytespider is a web crawler operated by ByteDance, the Chinese owner of TikTok. It’s allegedly used to download training data for its LLMs (Large Language Model) including those powering ChatGPT competitor Doubao.”
CCBot
User agent: CCBot
As stated on their website, CCBot is from Common Crawl, which is a “non-profit foundation founded with the goal of democratizing access to web information by producing and maintaining an open repository of web crawl data that is universally accessible and analyzable by anyone.” Data crawled by CCBot from your site may be used by another party for AI data training purposes, so if that is your concern, add CCBot to your disallow list.
ChatGPT User
User agent: ChatGPT-User
A user delineation of GPTBot. May be deprecated as of now and may instead utilize the core GPTBot user agent instead.
Cohere AI
User agent: cohere-ai
From their website, “Cohere provides industry-leading large language models (LLMs) and RAG capabilities tailored to meet the needs of enterprise use cases that solve real-world problems.”
Diffbot
User agent: Diffbot
“Transform the web into data. Diffbot automates web data extraction from any website using AI, computer vision, and machine learning.”
DuckDuckGo
User agent: DuckAssistBot
Facebook/Meta
User agent: FacebookBot
User agent: Meta-ExternalAgent
GoogleOther
User agent: GoogleOther
User agent: Google-CloudVertexBot
Used by Google to crawl for internal research and development. It’s unknown what exactly this entails, but is a generic user agent that is used when no other appropriate user agent is available. Documentation available from Google.
Google-Extended
User agent: Google-Extended
A newer user agent which feeds data to Bard (Their AI search engine product) and Vertex AI generative APIs. This also includes future models of these. Documentation available from Google.
GPTBot
User agent: GPTBot
OpenAI’s web crawler, also known as the company behind ChatGPT. Documentation is available from OpenAI.
Huawei
User agent: PetalBot
ImagesiftBot
User-agent: ImagesiftBot
From Neil Clark, “ImagesiftBot is billed as a reverse image search tool, but it’s associated with The Hive, a company that produces models for image generation.”
PerplexityBot
User-agent: PerplexityBot
“Perplexity is a free AI search engine that provides trusted answers to any question.”
Webz.io
User agent: OmigiliBot
User-agent: Omigili
Webz.io’s bot which can sometimes be used to sell crawled data to LLM companies.

One Reply to “Which web crawlers are associated with AI crawlers?”
Comments are closed.