Features

AI Crawlers and robots.txt: How to Control GPTBot, ClaudeBot, and Other User Agents

·9 min read

Quick answer: AI companies like OpenAI, Google, Anthropic, and Meta now crawl websites to train and power their AI models. You can control which AI crawlers access your content using robots.txt, but each crawler uses a different user-agent — and blocking them has trade-offs for your site's visibility in AI-powered search.

What Are AI Crawlers?

AI crawlers are bots that download and index web content for use in large language models (LLMs). Unlike traditional search engine crawlers (Googlebot, Bingbot), AI crawlers may use your content totrain models or to generate real-time answers citing your site.

CrawlerCompanyPurposeUser-Agent
GPTBotOpenAITraining data for GPT modelsGPTBot
ChatGPT-UserOpenAIReal-time browsing for ChatGPT answersChatGPT-User
Google-ExtendedGoogleTraining Gemini modelsGoogle-Extended
GooglebotGoogleSearch indexing + AI OverviewsGooglebot
ClaudeBotAnthropicTraining Claude modelsClaudeBot / anthropic-ai
Meta-ExternalAgentMetaTraining LLaMA modelsMeta-ExternalAgent
PerplexityBotPerplexitySearch + citation answersPerplexityBot
Applebot-ExtendedAppleApple Intelligence featuresApplebot-Extended
BytespiderByteDanceTraining for TikTok AIBytespider
CCBotCommon CrawlOpen dataset used by many AI companiesCCBot

How to Control AI Crawlers With robots.txt

The robots.txt file at the root of your website tells crawlers which pages they can and cannot access. Here's how to configure it for different strategies:

Strategy 1: Block All AI Training Crawlers

If you want to prevent your content from being used to train AI models but still appear in search results:

# Block AI training crawlers
User-agent: GPTBot
Disallow: /

User-agent: Google-Extended
Disallow: /

User-agent: ClaudeBot
Disallow: /

User-agent: Meta-ExternalAgent
Disallow: /

User-agent: Bytespider
Disallow: /

User-agent: CCBot
Disallow: /

# Allow real-time AI search (ChatGPT browsing, Perplexity)
User-agent: ChatGPT-User
Allow: /

User-agent: PerplexityBot
Allow: /

# Allow regular search engines
User-agent: Googlebot
Allow: /

Strategy 2: Allow Everything (Maximize AI Visibility)

If you want AI models to cite your content (recommended for content marketing and SEO):

# Welcome all crawlers
User-agent: *
Allow: /

# Explicitly welcome AI crawlers
User-agent: GPTBot
Allow: /

User-agent: ChatGPT-User
Allow: /

User-agent: Google-Extended
Allow: /

User-agent: ClaudeBot
Allow: /

User-agent: PerplexityBot
Allow: /

Strategy 3: Selective Access (Recommended)

Allow AI browsing bots that cite your source while blocking training-only crawlers:

# Block training-only crawlers
User-agent: GPTBot
Disallow: /

User-agent: Google-Extended
Disallow: /

User-agent: Meta-ExternalAgent
Disallow: /

# Allow citation crawlers (they link back to you)
User-agent: ChatGPT-User
Allow: /

User-agent: PerplexityBot
Allow: /

User-agent: Applebot-Extended
Allow: /

The Trade-Off: Training vs. Citation

There's a key distinction between AI crawlers that train models (GPTBot, Google-Extended) and those that browse in real-time to answer user questions (ChatGPT-User, PerplexityBot):

TypeExamplesYour Content Is…You Get…
Training crawlersGPTBot, Google-Extended, CCBotAbsorbed into the modelNo attribution or link
Citation crawlersChatGPT-User, PerplexityBotQuoted with a source linkTraffic + brand visibility

TL;DR: Blocking training crawlers protects your IP. Allowing citation crawlers drives traffic. Most businesses should use Strategy 3 above.

GDPR and AI Crawling: Legal Considerations

Under GDPR, AI crawling raises questions about data processing purposes andlegitimate interest. If an AI crawler processes personal data from your website (e.g., contact pages, team directories), this could constitute data processing under GDPR.

  • Opt-out right: Some DPAs argue that website owners should be able to opt out of AI training. The robots.txt mechanism is currently the de facto opt-out method
  • Copyright: The EU AI Act requires AI providers to respect robots.txt for training data collection (Article 53). Non-compliance could result in penalties
  • Transparency: Under GDPR Article 14, AI companies should inform data subjects (you) about how their data is being processed

How to Monitor AI Crawlers on Your Site

Check your server logs or analytics for these user-agents. You can use PrivacyChecker to scan your site and identify which third-party connections are made, including AI-related services.

Frequently Asked Questions

Does blocking GPTBot remove my content from ChatGPT?

Blocking GPTBot prevents your content from being used in future training runs. Content already in the model from previous crawls remains. To also prevent ChatGPT from browsing your site in real-time, you must also block ChatGPT-User — but this means ChatGPT won't cite your site in its answers.

Does blocking Google-Extended affect my Google search rankings?

No. Google-Extended is separate from Googlebot. Blocking Google-Extended only prevents Google from using your content to train Gemini. Your search rankings remain unaffected.

Can AI crawlers bypass robots.txt?

Legally, no — the EU AI Act explicitly requires compliance. Technically, some crawlers may not respect robots.txt. Server-level blocking (IP ranges, rate limiting) provides stronger enforcement. OpenAI and Google publish their crawler IP ranges for this purpose.

Should I block all AI crawlers?

It depends on your goals. If your business benefits from visibility in AI-powered search (most do), allow citation crawlers. If you're a publisher whose content is being copied without attribution, blocking training crawlers protects your intellectual property.

Check your website now — free

Run a complete privacy audit in under 60 seconds. Get your score, find issues, and learn how to fix them.

Start Free Audit