Quick answer: To block AI crawlers like GPTBot, ChatGPT-User, ClaudeBot, and others from scraping your website, add specific rules to your robots.txt file and use HTTP headers for fine-grained control. Here's the complete list of AI user agents and how to block (or allow) each one.
Why Block (or Allow) AI Crawlers?
AI companies like OpenAI, Anthropic, Google, and others send crawlers to scrape website content for training their language models. Unlike search engine bots (which index your pages for search results), AI crawlers use your content to build commercial products — often without compensation or attribution.
| Reason to block | Reason to allow |
|---|---|
| Protect proprietary content from being used in AI training | Get cited in AI answers (ChatGPT, Perplexity, etc.) |
| Reduce server load from aggressive crawling | Increase brand visibility through AI-generated recommendations |
| Copyright and licensing concerns | Drive referral traffic from AI tools that link to sources |
| Competitive advantage — don't feed competitor AI models | Participate in AI Search (Google AI Overview, Bing Chat) |
Complete List of AI Crawlers (2026)
| User Agent | Company | Purpose | Respects robots.txt |
|---|---|---|---|
GPTBot | OpenAI | Training data for GPT models | Yes |
ChatGPT-User | OpenAI | Real-time browsing (ChatGPT with browsing) | Yes |
OAI-SearchBot | OpenAI | SearchGPT / ChatGPT Search | Yes |
ClaudeBot | Anthropic | Training data for Claude | Yes |
anthropic-ai | Anthropic | Web browsing for Claude | Yes |
Google-Extended | Training Gemini / Bard | Yes | |
Googlebot | Search indexing + AI Overview | Yes (don't block) | |
PerplexityBot | Perplexity AI | AI search engine | Yes |
Applebot-Extended | Apple | Apple Intelligence / Siri | Yes |
Bytespider | ByteDance | TikTok AI training | Partially |
CCBot | Common Crawl | Open dataset (used by many AI labs) | Yes |
FacebookBot | Meta | AI training for Llama | Yes |
meta-externalagent | Meta | Meta AI browsing | Yes |
cohere-ai | Cohere | Enterprise AI training | Yes |
Diffbot | Diffbot | Web data extraction for AI | Partially |
Option 1: Block All AI Crawlers via robots.txt
Add this to your robots.txt file (usually at yoursite.com/robots.txt):
# Block AI training crawlers
User-agent: GPTBot
Disallow: /
User-agent: ChatGPT-User
Disallow: /
User-agent: OAI-SearchBot
Disallow: /
User-agent: ClaudeBot
Disallow: /
User-agent: anthropic-ai
Disallow: /
User-agent: Google-Extended
Disallow: /
User-agent: PerplexityBot
Disallow: /
User-agent: Applebot-Extended
Disallow: /
User-agent: Bytespider
Disallow: /
User-agent: CCBot
Disallow: /
User-agent: FacebookBot
Disallow: /
User-agent: meta-externalagent
Disallow: /
User-agent: cohere-ai
Disallow: /
User-agent: Diffbot
Disallow: /Option 2: Block Training, Allow AI Search
If you want to appear in AI search results (ChatGPT Search, Perplexity, Google AI Overview) but don't want your content used for training, use this selective configuration:
# Block AI TRAINING crawlers
User-agent: GPTBot
Disallow: /
User-agent: ClaudeBot
Disallow: /
User-agent: Google-Extended
Disallow: /
User-agent: CCBot
Disallow: /
User-agent: Bytespider
Disallow: /
User-agent: FacebookBot
Disallow: /
# ALLOW AI search/browsing bots (so you appear in AI answers)
User-agent: ChatGPT-User
Allow: /
User-agent: OAI-SearchBot
Allow: /
User-agent: PerplexityBot
Allow: /
User-agent: Applebot-Extended
Allow: /Option 3: HTTP Headers (More Control)
For page-level control, use the X-Robots-Tag HTTP header. This is useful when you want to block AI crawlers from specific pages (like premium content) while allowing them on others.
In your server config (Nginx example):
# Block GPTBot from premium content
location /premium/ {
add_header X-Robots-Tag "noai, noimageai" always;
}Google also supports the nosnippet and max-snippet:0 meta tags to prevent content from appearing in AI Overviews:
<meta name="robots" content="max-snippet:0">How to Verify Your Blocks Are Working
- Test robots.txt: Visit
yoursite.com/robots.txtand verify the rules are present - Use Google Search Console: The robots.txt tester shows whether specific user agents are blocked
- Check server logs: Search for AI bot user agents in your access logs to see if they're still crawling
- Use PrivacyChecker: Our scanner checks your robots.txt configuration and flags AI crawlers that aren't blocked (or that are allowed)
The Copyright Angle
As of 2026, several legal precedents affect AI crawling:
- EU AI Act (2024): Requires AI providers to document training data sources and respect copyright opt-outs
- EU Copyright Directive (Article 4): Text and data mining for commercial AI requires an opt-out mechanism — robots.txt is the de facto standard
- NYT v. OpenAI (US, 2024): Established that large-scale scraping for AI training can constitute copyright infringement
- TDM Reservation Protocol: Some publishers use the
tdm-reservation: 1header to explicitly reserve text/data mining rights
Frequently Asked Questions
Does blocking GPTBot prevent my site from appearing in ChatGPT?
Not exactly. Blocking GPTBot prevents OpenAI from using your content for training future models. But ChatGPT-User is a separate bot used for real-time browsing — if you allow ChatGPT-User, your content can still appear when users ask ChatGPT to browse the web.
Will blocking AI crawlers hurt my Google SEO ranking?
No. Blocking Google-Extended only prevents Google from using your content for Gemini/AI training. It does not affect Googlebot (the search index crawler). Your search rankings are unaffected. However, blocking Googlebotwill remove you from search results entirely — never block Googlebot.
Is robots.txt legally binding?
Not directly, but it's increasingly recognized in court. The EU Copyright Directive recognizes robots.txt as a valid machine-readable opt-out. OpenAI, Anthropic, and Google have all publicly committed to respecting robots.txt. Ignoring a robots.txt block could strengthen a copyright infringement claim.