How to Configure robots.txt for AI Crawlers (Without Blocking Google)

Pleqo Team
9 min read
Technical SEO

The robots.txt File Just Became a Strategic Decision

For most of its history, robots.txt was a housekeeping file. You blocked crawler access to admin pages, staging environments, and duplicate content paths. If you got something wrong, you lost a few pages from Google index. Annoying, but fixable within a crawl cycle.

That dynamic shifted when AI companies started sending crawlers across the web. GPTBot, ClaudeBot, PerplexityBot, Bytespider, Google-Extended. Each one checks your robots.txt before deciding whether to read your pages. Your robots.txt is no longer just about search engine indexing. It is the front door to AI visibility. Get it wrong, and you silently vanish from AI-generated answers. Get it carelessly wrong, and you block Googlebot in the process.

Your robots.txt is now a business decision, not a technical chore. Every Disallow rule you write determines whether your brand appears in AI answers across ChatGPT, Perplexity, Gemini, Claude, DeepSeek, Grok, and Google AI Overviews or disappears from them.

The tricky part: AI crawlers and traditional search crawlers use the same access mechanism but serve different purposes. Googlebot indexes pages for search results. GPTBot reads content for model training and real-time retrieval. Google-Extended handles AI training data separately from regular search indexing. Blocking the wrong user-agent has consequences you did not plan for.

This guide covers every major AI crawler user-agent string, shows exact robots.txt configurations for common scenarios, and flags the mistakes that cost sites their visibility.

See also: AI Crawler List 2026: Every Bot That Scrapes Your Site (and What They Do)


What robots.txt Actually Controls (and What It Does Not)

Before writing rules for AI bots, understand the boundaries of what this file can do.

robots.txt is a voluntary protocol. It tells crawlers which paths they should not access. The word "should" matters here. Compliant crawlers read the file and follow the rules. Non-compliant ones ignore it. There is no enforcement mechanism built into the protocol.

For traditional search, this was rarely a problem. Googlebot and Bingbot respect robots.txt reliably. Rogue scrapers have always ignored it, and that was accepted as a cost of being on the open web.

What robots.txt controls

  • Which URL paths a specific crawler can access
  • Which URL paths are off-limits to all crawlers via wildcard rules
  • Sitemap location (informational, not a directive)

What robots.txt does NOT control

  • Whether content already scraped gets removed from training datasets
  • How a bot uses content it collected before your rule existed
  • Crawl rate or frequency (the Crawl-delay directive exists but not all bots honor it)
  • Access from bots that do not identify themselves or spoof their user-agent
  • Your content on third-party sites, social media, or syndicated feeds

robots.txt is forward-looking, not retroactive. If GPTBot crawled your site last month, adding a Disallow today stops future visits. It does not delete what was already collected. For retroactive removal, you need to contact the AI company directly.

This distinction matters. Many site owners block AI crawlers expecting their content to vanish from ChatGPT or Perplexity responses. It will not. The block only prevents new crawl visits going forward.


AI Crawler User-Agents: The Complete Reference

Each AI company uses one or more user-agent strings to identify its crawlers. You need these strings to write targeted robots.txt rules. Here is every major AI crawler active in 2026.

OpenAI

Bot User-Agent String Purpose
GPTBot GPTBot Training data + live retrieval for ChatGPT
OAI-SearchBot OAI-SearchBot Real-time web search for ChatGPT search feature
ChatGPT-User ChatGPT-User Browsing mode (user-initiated URL visits)

GPTBot is the primary crawler. OAI-SearchBot handles real-time search queries within ChatGPT. ChatGPT-User activates when someone explicitly asks ChatGPT to browse a specific page. Blocking GPTBot alone does not block all OpenAI access. You need to address all three user-agents separately.

Anthropic

Bot User-Agent String Purpose
ClaudeBot ClaudeBot Training data collection for Claude models
anthropic-ai anthropic-ai Older Anthropic crawler identifier

ClaudeBot is the current primary crawler. The anthropic-ai identifier is older and appears less frequently in logs, but still shows up on some sites.

Google

Bot User-Agent String Purpose
Google-Extended Google-Extended AI training data for Gemini, separate from search
Googlebot Googlebot Traditional search indexing + AI Overviews

This pair is the most misunderstood. Googlebot handles both traditional search indexing and Google AI Overviews. Google-Extended handles AI model training only. Blocking Google-Extended does not affect your search rankings or AI Overviews appearance. Blocking Googlebot kills your entire Google search presence. Know which one you mean.

Perplexity

Bot User-Agent String Purpose
PerplexityBot PerplexityBot Real-time retrieval for Perplexity answers

PerplexityBot crawls for live retrieval, not bulk training. It fetches pages when a user asks a question that matches your content.

ByteDance

Bot User-Agent String Purpose
Bytespider Bytespider Training data for ByteDance AI products

Bytespider is one of the most aggressive crawlers on the web by sheer request volume.

Other Notable Bots

Bot User-Agent String Purpose
CCBot CCBot Common Crawl dataset (used by many AI companies)
Applebot-Extended Applebot-Extended Apple Intelligence features
cohere-ai cohere-ai Cohere model training
Diffbot Diffbot Structured data extraction for AI products
FacebookExternalHit FacebookExternalHit Meta AI features
ImagesiftBot ImagesiftBot Image analysis for AI systems
Timpibot Timpibot Timpi decentralized search engine

For the full breakdown of every bot including IP ranges, crawl frequency patterns, and compliance records, see our complete AI crawler reference.


Most sites fall into one of three scenarios. Here is the right robots.txt approach for each.

If your goal is to appear in as many AI-generated answers as possible, allow all major AI crawlers. Block only aggressive training-only bots that consume bandwidth without providing attribution.

# Search engines
User-agent: Googlebot
Allow: /

User-agent: Bingbot
Allow: /

# AI crawlers, allowed for visibility
User-agent: GPTBot
Allow: /

User-agent: OAI-SearchBot
Allow: /

User-agent: ChatGPT-User
Allow: /

User-agent: ClaudeBot
Allow: /

User-agent: PerplexityBot
Allow: /

User-agent: Google-Extended
Allow: /

User-agent: Applebot-Extended
Allow: /

# Block aggressive training-only crawlers
User-agent: Bytespider
Disallow: /

User-agent: CCBot
Disallow: /

# Default rule
User-agent: *
Allow: /
Disallow: /admin/
Disallow: /api/
Disallow: /staging/

Sitemap: https://yoursite.com/sitemap.xml

This opens your content to every AI platform that provides direct brand visibility: ChatGPT, Claude, Perplexity, Gemini, Google AI Overviews, Apple Intelligence. Bulk training crawlers stay off your server.

Scenario 2: Selective AI Access (Retrieval Only, No Training)

You want your content cited in AI answers but not ingested for model training. The line between training and retrieval is blurry for some crawlers, but you can approximate it:

# Search engines
User-agent: Googlebot
Allow: /

User-agent: Bingbot
Allow: /

# Retrieval-focused AI bots, allowed
User-agent: OAI-SearchBot
Allow: /

User-agent: ChatGPT-User
Allow: /

User-agent: PerplexityBot
Allow: /

# Training-focused crawlers, blocked
User-agent: GPTBot
Disallow: /

User-agent: ClaudeBot
Disallow: /

User-agent: Google-Extended
Disallow: /

User-agent: Bytespider
Disallow: /

User-agent: CCBot
Disallow: /

User-agent: cohere-ai
Disallow: /

User-agent: *
Allow: /
Disallow: /admin/

Sitemap: https://yoursite.com/sitemap.xml

The trade-off is real: blocking GPTBot may reduce your visibility in ChatGPT over time. OpenAI uses GPTBot for both training and some retrieval. This scenario prioritizes content protection over maximum reach.

Scenario 3: Block All AI Crawlers

Valid for publishers with licensing concerns. Not recommended if you want AI visibility:

# Allow search engines only
User-agent: Googlebot
Allow: /

User-agent: Bingbot
Allow: /

# Block all known AI crawlers
User-agent: GPTBot
Disallow: /

User-agent: OAI-SearchBot
Disallow: /

User-agent: ChatGPT-User
Disallow: /

User-agent: ClaudeBot
Disallow: /

User-agent: anthropic-ai
Disallow: /

User-agent: PerplexityBot
Disallow: /

User-agent: Google-Extended
Disallow: /

User-agent: Bytespider
Disallow: /

User-agent: CCBot
Disallow: /

User-agent: Applebot-Extended
Disallow: /

User-agent: cohere-ai
Disallow: /

User-agent: Diffbot
Disallow: /

User-agent: *
Disallow: /admin/
Allow: /

Sitemap: https://yoursite.com/sitemap.xml

Blocking all AI crawlers does not make your content invisible to AI. Your text may still surface through Common Crawl archives collected before the block, through third-party syndication, cached copies, and social media shares. Full AI invisibility is practically impossible through robots.txt alone.


Five Common Mistakes (and How to Fix Them)

Mistake 1: Wildcard Block That Catches Googlebot

The most damaging and the most common:

# DO NOT DO THIS
User-agent: *
Disallow: /

This blocks every crawler on the internet, including Googlebot. Your site disappears from search results. If you want to block AI crawlers, list them individually by user-agent name. Never use a wildcard Disallow on the root path without explicit Allow rules for the crawlers you need.

How to fix it: Add explicit Allow rules for Googlebot and Bingbot above any broad Disallow. Better yet, avoid wildcard root blocks entirely and name each bot individually.

Mistake 2: Confusing Google-Extended with Googlebot

Google-Extended controls AI training data only. Blocking it does not touch your search rankings or AI Overviews visibility. But some site owners block both Google-Extended and Googlebot, thinking they are being thorough.

Result: their site vanishes from Google search. Completely.

How to fix it: If you want to stay in Google search and AI Overviews but keep content out of Gemini training, block only Google-Extended. Leave Googlebot alone.

Mistake 3: Forgetting OAI-SearchBot

GPTBot gets all the attention, but OAI-SearchBot is a separate user-agent for ChatGPT real-time search. Block GPTBot and leave OAI-SearchBot unaddressed? ChatGPT can still pull from your pages through its search function.

How to fix it: If you want to block all OpenAI access, include rules for GPTBot, OAI-SearchBot, and ChatGPT-User. All three.

Mistake 4: Not Verifying After Deployment

You saved the file and moved on. But did the change take effect? Common failure modes: your CDN caches the old robots.txt for hours. The file has wrong encoding. It deployed to the wrong directory. A redirect loop exists on /robots.txt.

How to fix it: After every change, fetch yoursite.com/robots.txt directly in a browser. Check response headers for cache directives. Use Google Search Console robots.txt tester for Googlebot validation. Monitor server logs for 48 hours.

Mistake 5: Treating robots.txt as a Security Layer

robots.txt is not access control. It is a polite request. It does not authenticate crawlers, encrypt content, or prevent any bot from reading your pages if it decides to ignore the file.

How to fix it: For sensitive content, use server-level controls: authentication, IP allowlists, WAF rules, or paywalls. robots.txt handles well-behaved bots. Firewalls handle everything else.

robots.txt tells well-behaved bots what you prefer. It does not enforce anything. For content protection, you need server-level access controls, not a text file in your root directory.


Testing Your Configuration

After writing or updating rules, validate before deploying.

Step 1: Syntax Validation

Use the robots.txt tester in Google Search Console. Enter your URL and verify Googlebot can access your key pages. This tool only tests Googlebot rules, but it catches syntax errors that affect all bots.

Step 2: Manual User-Agent Simulation

Use curl to see how your server responds to different bot identifiers:

curl -A "GPTBot" https://yoursite.com/robots.txt
curl -A "ClaudeBot" https://yoursite.com/robots.txt
curl -A "PerplexityBot" https://yoursite.com/robots.txt

The file content is identical regardless of who requests it, but walking through the rules mentally for each user-agent helps you catch logic errors before they cost you visibility.

Step 3: Log Monitoring

After deployment, check your server access logs for AI crawler activity. Look for the user-agent strings listed in this article. If you blocked PerplexityBot but still see it hitting your pages 48 hours later, either your CDN is serving a stale robots.txt or the bot is not obeying your rules.

Fields to watch:

  • User-agent string in request headers
  • Requested URL paths (is the bot accessing blocked paths?)
  • HTTP response codes (200, 403, 429?)
  • Request frequency (has it changed since your update?)

Step 4: Quarterly Review

AI companies launch new crawlers, rename existing ones, and change behavior regularly. Review your robots.txt every quarter. Check the current AI crawler list for new additions. A configuration written in January may have blind spots by June.


The Decision Framework

Not sure which approach fits? Walk through these four questions.

Do you want your brand cited in AI-generated answers? If yes, allow GPTBot, ClaudeBot, PerplexityBot, Google-Extended, OAI-SearchBot, and Applebot-Extended. This is the maximum visibility path and the right default for most brands.

Are you concerned about model training? If yes but you still want AI citations, allow retrieval bots (OAI-SearchBot, ChatGPT-User, PerplexityBot) and block training bots (GPTBot, ClaudeBot, Google-Extended, CCBot, Bytespider). Accept that the boundary is imperfect.

Are you a publisher with licensing concerns? Block all AI crawlers. Explore direct licensing agreements with AI companies. OpenAI, Google, and Apple all have publisher partnership programs that compensate content usage under negotiated terms.

Are you unsure? Start with maximum visibility. Monitor for 30 days. Check whether AI platforms cite your brand more often. If the citation value is positive, keep the configuration open. You can tighten rules later. Loosening them is harder because you lose crawl momentum while blocked.

The default position for most brands in 2026: allow AI crawlers, monitor what happens, adjust based on data. Blocking by default means opting out of a distribution channel that grows every quarter while traditional search traffic plateaus.

See also: E-E-A-T and AI Visibility: Why Google's Quality Framework Matters for GEO

Frequently Asked Questions

Most major AI crawlers respect robots.txt directives. GPTBot, ClaudeBot, and Google-Extended all honor Disallow rules. However, not all AI bots are equally compliant. Monitoring your server logs is the only way to verify actual compliance.

Blocking GPTBot prevents OpenAI from crawling your site for future training data and live retrieval. However, content already in the training dataset will remain. The directive is forward-looking: it stops new crawling, not retroactive data removal.

Yes. Google AI Overviews uses Googlebot, which is separate from Google-Extended. You can block GPTBot specifically while keeping Googlebot allowed. This lets your content appear in AI Overviews and traditional search while preventing OpenAI from crawling your pages.

If your robots.txt does not mention a specific AI crawler, the bot falls back to your general rules. If you have no wildcard Disallow, the bot can crawl everything. This is fine for many sites as it means maximum AI visibility.

It depends on your goals. Blocking protects content from model training but removes your brand from AI-generated answers. For most brands, the visibility benefit outweighs the risk. A selective approach works best for those who want both protection and presence.

Written by

Pleqo Team

Pleqo is the AI brand visibility platform that helps businesses monitor, analyze, and improve their presence across 7 AI search engines.

Related Articles

See where AI mentions your brand

Track your visibility across ChatGPT, Perplexity, Gemini, and 4 more AI platforms.

Try Free for 7 Days