The Invisible Traffic Eating Your Bandwidth
Your server logs tell a story that most site owners never read. Somewhere between Googlebot and the occasional rogue scraper, a new category of traffic has quietly become one of the largest consumers of your bandwidth: AI crawlers.
These bots are not indexing your pages for a search engine results page. They read your content so that AI models can learn from it, cite it, or paraphrase it when answering user queries. Some do this for training, ingesting your text to improve a large language model. Others do it for live retrieval, fetching your pages in real time when a user asks a question your content can answer. The difference between these two purposes has real implications for your business, your traffic, and your competitive position.
The AI crawler landscape is fragmented, poorly documented, and changing fast. GPTBot launched in August 2023 and has already gone through multiple behavioral changes. ClaudeBot appeared shortly after. Bytespider is one of the most aggressive crawlers on the web. And those are just the well-known ones.
This article is a living reference. We document every known AI crawler active in 2026: its user-agent string, parent company, what it does with your content, whether it respects robots.txt, and how aggressively it crawls.
See also: How to Configure robots.txt for AI Crawlers (Without Blocking Google)
Training Bots vs. Retrieval Bots: Why the Distinction Matters
Before listing every crawler, you need to understand the two primary categories. This distinction shapes every decision you make about allowing or blocking AI bots.
Training crawlers collect web content to build or improve AI models. Your text goes into a training dataset, gets processed, and becomes part of the model. Once trained, the model does not need to visit your site again to reference that information. You get no traffic, no attribution, and no link back. Examples: GPTBot (training mode), Google-Extended, CCBot, Bytespider.
Retrieval crawlers fetch your content in real time when a user asks a question. The AI platform sends a bot to your page, reads the relevant section, and includes it (often with a citation) in the generated answer. This is closer to how search engines work, except the user sees a synthesized answer rather than a list of links. Examples: PerplexityBot, OAI-SearchBot, ChatGPT-User.
Some bots serve both purposes. GPTBot crawls for training data and also supports live retrieval for ChatGPT. This dual role makes blocking decisions complicated. Blocking GPTBot protects your content from training use, but it may also reduce your visibility in ChatGPT live answers.
Training bots take your content to build their models. Retrieval bots fetch your content to answer specific questions, sometimes with attribution. The first costs you bandwidth with no return. The second can drive brand awareness. Your robots.txt strategy should reflect this difference.
The Complete AI Crawler Reference Table
Here is every major AI crawler active in 2026. For each bot, we list the user-agent string, parent company, primary purpose, robots.txt compliance, and crawl behavior notes.
| Bot Name | User-Agent | Company | Purpose | Respects robots.txt | Notes |
|---|---|---|---|---|---|
| GPTBot | GPTBot |
OpenAI | Training + retrieval | Yes | Primary OpenAI crawler. Dual-purpose. |
| OAI-SearchBot | OAI-SearchBot |
OpenAI | Live search retrieval | Yes | Powers ChatGPT search feature. |
| ChatGPT-User | ChatGPT-User |
OpenAI | User-initiated browsing | Yes | Activates when users ask ChatGPT to visit a URL. |
| ClaudeBot | ClaudeBot |
Anthropic | Training | Yes | Primary Anthropic crawler. |
| anthropic-ai | anthropic-ai |
Anthropic | Training (legacy) | Yes | Older identifier, still appears in some logs. |
| Google-Extended | Google-Extended |
AI training (Gemini) | Yes | Separate from Googlebot. Does not affect search. | |
| PerplexityBot | PerplexityBot |
Perplexity | Live retrieval | Yes | Fetches pages for real-time answers with citations. |
| Bytespider | Bytespider |
ByteDance | Training | Claimed | One of the most aggressive crawlers by volume. |
| CCBot | CCBot/2.0 |
Common Crawl | Training dataset | Yes | Open dataset used by many AI companies. |
| Applebot-Extended | Applebot-Extended |
Apple | Apple Intelligence | Yes | Separate from regular Applebot. |
| cohere-ai | cohere-ai |
Cohere | Training | Yes | Powers Cohere language models. |
| Diffbot | Diffbot |
Diffbot | Structured extraction | Partial | Extracts structured data for AI products. |
| FacebookExternalHit | FacebookExternalHit |
Meta | Meta AI features | Partial | Also used for link preview generation. |
| ImagesiftBot | ImagesiftBot |
Hive | Image analysis | Partial | Processes images for AI classification. |
| Timpibot | Timpibot |
Timpi | Decentralized search | Yes | Smaller player, growing presence. |
| Amazonbot | Amazonbot |
Amazon | Alexa AI / shopping | Yes | Product and knowledge crawling. |
| YouBot | YouBot |
You.com | Search + AI answers | Yes | Powers You.com AI search. |
| PetalBot | PetalBot |
Huawei | Search + AI | Yes | Powers Huawei Petal Search. |
This table covers the bots you will most commonly find in server logs. Dozens of smaller, less-documented crawlers also exist from AI startups and research institutions. We focus on the ones with enough traffic volume and identifiable user-agents to act on.
GPTBot: The Bot Everyone Talks About
GPTBot is OpenAI's primary web crawler and the most discussed AI bot since its public disclosure in August 2023.
User-agent string: GPTBot/1.0
What it does: GPTBot serves two functions. First, it crawls the web to collect training data for OpenAI models (GPT-4, GPT-5, and successors). Second, it supports real-time content retrieval for ChatGPT when the model needs fresh information. This dual purpose makes it the hardest bot to make simple allow/block decisions about.
Crawl behavior: GPTBot sends requests from documented IP ranges (published at openai.com). Its crawl rate varies significantly by site. High-authority domains with fresh content see multiple visits per day. Smaller sites may see weekly or less frequent crawls.
robots.txt compliance: GPTBot respects robots.txt Disallow directives. However, blocking GPTBot only prevents future crawling. Content already collected before the block remains in OpenAI datasets.
What to watch for: Since 2024, OpenAI introduced OAI-SearchBot and ChatGPT-User as separate crawlers. If you block GPTBot but not these two, ChatGPT can still access your content through its search and browsing features. For full OpenAI blocking, address all three user-agents.
GPTBot is both a training crawler and a retrieval crawler. Blocking it protects your content from training use, but may also reduce your appearance in ChatGPT live answers. There is no way to allow one function while blocking the other through robots.txt.
ClaudeBot: Anthropic's Training Crawler
ClaudeBot is Anthropic's web crawler, used to collect training data for Claude models.
User-agent string: ClaudeBot/1.0
What it does: ClaudeBot crawls web pages to build training datasets for Claude. Unlike GPTBot, ClaudeBot does not currently have a documented live-retrieval mode. Its primary function is data collection for model training.
Crawl behavior: ClaudeBot is less aggressive than GPTBot or Bytespider. It crawls at moderate rates and primarily targets text-heavy, high-authority pages. It respects crawl-delay directives when present.
robots.txt compliance: ClaudeBot respects robots.txt. Anthropic also honors the anthropic-ai user-agent as a legacy identifier, so existing rules using that string still work.
What to watch for: As Anthropic expands Claude's web-connected features, additional crawlers may appear. Monitor your logs for any new user-agents containing "anthropic" or "claude" strings.
PerplexityBot: The Retrieval Specialist
PerplexityBot is different from most AI crawlers on this list. It is primarily a retrieval bot, not a training bot.
User-agent string: PerplexityBot
What it does: When a user asks Perplexity a question, PerplexityBot fetches relevant web pages in real time, extracts the answer, and presents it with source citations. Your content appears in Perplexity answers with a link back to your site. This is the closest any AI crawler comes to traditional search engine behavior.
Crawl behavior: PerplexityBot crawls on-demand, triggered by user queries rather than scheduled sweeps. It does not maintain a large index. High-visibility pages may get frequent requests; niche pages are only fetched when someone asks a matching question.
robots.txt compliance: PerplexityBot respects robots.txt. Blocking it removes your content from Perplexity answers, which means losing both the citation and the referral traffic.
PerplexityBot is the one AI crawler where blocking has an immediate, visible cost. Unlike training bots, PerplexityBot provides real-time attribution and referral links. Blocking it is blocking a traffic source.
Google-Extended: Separating Search from AI Training
Google-Extended is one of the most important distinctions in the AI crawler world and one of the most frequently misunderstood.
User-agent string: Google-Extended
What it does: Google-Extended crawls your content specifically for AI model training (Gemini). It is completely separate from Googlebot, which handles traditional search indexing and Google AI Overviews.
The critical distinction: Blocking Google-Extended does NOT affect your Google search rankings. It does NOT remove your content from Google AI Overviews. It only prevents your content from being used in Gemini model training. Blocking Googlebot, on the other hand, removes you from Google search entirely. This confusion has caused real damage. Site owners who intended to block AI training have accidentally blocked Googlebot, killing their search visibility overnight.
Crawl behavior: Google-Extended crawls at rates determined by Google infrastructure. You cannot control its frequency through robots.txt beyond allowing or blocking it entirely.
Bytespider: The High-Volume Training Crawler
Bytespider is ByteDance's web crawler and one of the most aggressive bots on the internet by request volume.
User-agent string: Bytespider
What it does: Bytespider collects training data for ByteDance AI products. It crawls at high volumes across millions of sites.
Crawl behavior: Multiple reports from site operators document Bytespider making tens of thousands of requests per day to individual sites. It has been flagged for ignoring crawl-delay directives and consuming disproportionate server resources. Some hosting providers have added Bytespider to default block lists because of bandwidth concerns.
robots.txt compliance: ByteDance states that Bytespider respects robots.txt. In practice, compliance reports are mixed. Some site owners report continued crawling after adding Disallow rules, though this may reflect caching delays rather than intentional non-compliance.
Bytespider is the one crawler where blocking is almost universally recommended. It provides no direct visibility benefit for English-language queries, and its aggressive crawl rate consumes server resources. Block it unless you have a specific reason not to.
CCBot: The Open Dataset Crawler
CCBot powers Common Crawl, a nonprofit that maintains one of the largest open web archives in the world.
User-agent string: CCBot/2.0
What it does: CCBot crawls the web to build the Common Crawl dataset, a massive open archive that many AI companies use as training data. When reports say AI models were "trained on the internet," Common Crawl is often the primary data source.
Why it matters for AI: Blocking CCBot does not just affect Common Crawl. It reduces the chance of your content appearing in any AI model that uses Common Crawl as a training source, which includes a large number of open-source and commercial models.
robots.txt compliance: CCBot respects robots.txt.
Applebot-Extended: Apple Intelligence
Applebot-Extended is Apple's AI-specific crawler, separate from the standard Applebot used for Siri and Safari suggestions.
User-agent string: Applebot-Extended
What it does: Applebot-Extended collects data for Apple Intelligence features, including on-device AI capabilities in recent iOS and macOS versions.
Crawl behavior: Less aggressive than most other AI crawlers. Apple has historically been conservative with crawl rates.
robots.txt compliance: Respects robots.txt. Apple has clear documentation on allowing or blocking Applebot-Extended independently from standard Applebot.
How to Monitor AI Crawler Activity on Your Site
Knowing which bots exist is step one. Knowing which ones actually visit your site is step two. Here is how to monitor effectively.
Server Log Analysis
Your web server access logs contain a user-agent field for every request. Filter for known AI crawler user-agents:
grep -E "GPTBot|ClaudeBot|PerplexityBot|Bytespider|CCBot|Google-Extended|OAI-SearchBot" /var/log/access.log | awk '{print $14}' | sort | uniq -c | sort -rn
This gives you a count of requests per bot, sorted by frequency. Run it weekly to spot trends and catch new arrivals.
What to Look For
Unexpected volume spikes. If a bot suddenly starts making 10x more requests than usual, investigate. It could mean a crawl configuration change on their side, or it could be a new bot spoofing a known user-agent.
New user-agent strings. AI companies launch new crawlers without always announcing them. Any user-agent you do not recognize that makes repeated requests to content pages (not just robots.txt) is worth investigating.
Blocked bots still crawling. If you added a Disallow rule for a specific bot but still see it in your logs, check whether your CDN is caching the old robots.txt. Also verify the bot is matching the correct user-agent string in your rules.
Crawl-to-visibility ratio. Some bots crawl heavily but produce no visible output. Your content never appears in their platform. This is a sign of pure training crawling with no retrieval benefit.
Monitor your logs monthly at minimum. The AI crawler landscape changes fast enough that a rule set from three months ago may have gaps. New bots appear, existing ones change behavior, and previously well-behaved crawlers occasionally go rogue.
Crawlers You Might Not Know About
Beyond the major players, several lesser-known AI crawlers are worth tracking.
YouBot (You.com): Powers the You.com AI search engine. Moderate crawl rates. Provides citations in search results. Blocking removes you from You.com answers.
PetalBot (Huawei): Crawls for Huawei Petal Search, which has significant market share in regions where Google is unavailable. Relevant if your audience includes users in China or certain parts of Asia.
Amazonbot (Amazon): Crawls for Alexa AI features and Amazon product knowledge. Relevant for e-commerce brands that want to appear in voice assistant answers.
cohere-ai (Cohere): Crawls training data for Cohere's enterprise AI models. Many B2B applications are built on Cohere, so your content may surface in enterprise tools even if you do not interact with Cohere directly.
Diffbot (Diffbot): Extracts structured data from web pages for use in knowledge graphs and AI products. Does not crawl for raw text training but rather for entity extraction and relationship mapping.
A Recommended robots.txt Template
Based on the bots documented above, here is a starting template that maximizes AI visibility while blocking aggressive training-only crawlers:
# Search engines
User-agent: Googlebot
Allow: /
User-agent: Bingbot
Allow: /
# AI crawlers: allowed (provide visibility or attribution)
User-agent: GPTBot
Allow: /
User-agent: OAI-SearchBot
Allow: /
User-agent: ChatGPT-User
Allow: /
User-agent: ClaudeBot
Allow: /
User-agent: PerplexityBot
Allow: /
User-agent: Google-Extended
Allow: /
User-agent: Applebot-Extended
Allow: /
# AI crawlers: blocked (aggressive, no direct visibility benefit)
User-agent: Bytespider
Disallow: /
User-agent: CCBot
Disallow: /
# Default
User-agent: *
Allow: /
Disallow: /admin/
Disallow: /api/
Disallow: /private/
Sitemap: https://yoursite.com/sitemap.xml
Customize based on your priorities. If content protection matters more than visibility, move GPTBot and ClaudeBot to the blocked section. If maximum reach is the goal, leave everything open and accept the bandwidth cost.
For detailed configuration guidance, testing steps, and common mistake prevention, see our robots.txt for AI crawlers guide.
What Comes Next
The AI crawler ecosystem is still young. New bots will appear every quarter. Existing ones will change names, merge capabilities, or split into more specialized variants. The companies behind them will announce some changes publicly and make others silently.
Your job is not to memorize every bot. Your job is to have a system: a robots.txt template that reflects your strategy, a monitoring process that catches new arrivals, and a quarterly review cycle that keeps your rules current.
The brands that get this right will control how their content flows into AI systems. The ones that ignore it will have that decision made for them, by bots they never knew existed.
See also: E-E-A-T and AI Visibility: Why Google's Quality Framework Matters for GEO