User agents list of all known AI web crawlers

While initially building out my Simple NoAI & NoImageAI plugin for WordPress, a meta directive was all we had at the time. Now that crawlers are using robots.txt to determine whether a site has opted in for web crawling, more well-behaved crawlers are looking for them. Therefor, it’s important to have a list of crawlers, what they’re noted to be used for, and their user agents in case you want to block them.

This information as of Oct 1st, 2025 is up to date, but may change. I will try to keep this list updated as much as I can. Information on this list is also supplemented by Cloudflare’s list of blockable AI crawlers.

Anthropic-AI

User agent: anthropic-ai
User agent: Claude-Web
User agent: ClaudeBot

Anthropic is, as their website states, “an AI safety and research company based in San Francisco.” TechTarget states of this company “Anthropic — founded by former members of Microsoft-backed AI research lab and vendor OpenAI — introduced Claude 2 on July 11. Claude 2 is the second iteration of Claude, the updated version of its AI assistant based on Anthropic’s research.”

Amazon

User agent: Amazonbot

Apple

User agent: Applebot-Extended
User agent: Applebot

Apple recently announced they’re going all in on AI. Per Apple’s website, “The data crawled by Applebot is used to power various features, such as the search technology that is integrated into many user experiences in Appleʼs ecosystem including Spotlight, Siri, and Safari. Enabling Applebot in robots.txt allows website content to appear in search results for Apple users around the world in these products.”

ByteDance/TikTok

User agent: Bytespider
User agent: TikTokSpider

From Darkvisitors.com, “Bytespider is a web crawler operated by ByteDance, the Chinese owner of TikTok. It’s allegedly used to download training data for its LLMs (Large Language Model) including those powering ChatGPT competitor Doubao.”

CCBot

User agent: CCBot

As stated on their website, CCBot is from Common Crawl, which is a “non-profit foundation founded with the goal of democratizing access to web information by producing and maintaining an open repository of web crawl data that is universally accessible and analyzable by anyone.” Data crawled by CCBot from your site may be used by another party for AI data training purposes, so if that is your concern, add CCBot to your disallow list.

ChatGPT User

User agent: ChatGPT-User

A user delineation of GPTBot. May be deprecated as of now and may instead utilize the core GPTBot user agent instead.

Cohere AI

User agent: cohere-ai

From their website, “Cohere provides industry-leading large language models (LLMs) and RAG capabilities tailored to meet the needs of enterprise use cases that solve real-world problems.”

Diffbot

User agent: Diffbot

“Transform the web into data. Diffbot automates web data extraction from any website using AI, computer vision, and machine learning.”

DuckDuckGo

User agent: DuckAssistBot

Facebook/Meta

User agent: FacebookBot
User agent: Meta-ExternalAgent

“FacebookBot crawls public web pages to improve language models for our speech recognition technology.”

GoogleOther

User agent: GoogleOther
User agent: Google-CloudVertexBot

Used by Google to crawl for internal research and development. It’s unknown what exactly this entails, but is a generic user agent that is used when no other appropriate user agent is available. Documentation available from Google.

Google-Extended

User agent: Google-Extended

A newer user agent which feeds data to Bard (Their AI search engine product) and Vertex AI generative APIs. This also includes future models of these. Documentation available from Google.

GPTBot

User agent: GPTBot

OpenAI’s web crawler, also known as the company behind ChatGPT. Documentation is available from OpenAI.

Huawei

User agent: PetalBot

ImagesiftBot

User-agent: ImagesiftBot

From Neil Clark, “ImagesiftBot is billed as a reverse image search tool, but it’s associated with The Hive, a company that produces models for image generation.”

PerplexityBot

User-agent: PerplexityBot

“Perplexity is a free AI search engine that provides trusted answers to any question.”

Webz.io

User agent: OmigiliBot
User-agent: Omigili

Webz.io’s bot which can sometimes be used to sell crawled data to LLM companies.

Which web crawlers are associated with AI crawlers?

Anthropic-AI

Amazon

Apple

ByteDance/TikTok

CCBot

ChatGPT User

Cohere AI

Diffbot

DuckDuckGo

Facebook/Meta

GoogleOther

Google-Extended

GPTBot

Huawei

ImagesiftBot

PerplexityBot

Webz.io

One Reply to “Which web crawlers are associated with AI crawlers?”

Anthropic-AI

Amazon

Apple

ByteDance/TikTok

CCBot

ChatGPT User

Cohere AI

Diffbot

DuckDuckGo

Facebook/Meta

GoogleOther

Google-Extended

GPTBot

Huawei

ImagesiftBot

PerplexityBot

Webz.io

Subscribe to Foundation!

One Reply to “Which web crawlers are associated with AI crawlers?”