How-To

How to Block AI Crawlers from Your Website (Complete 2026 Guide)

·9 min read

Quick answer: To block AI crawlers like GPTBot, ChatGPT-User, ClaudeBot, and others from scraping your website, add specific rules to your robots.txt file and use HTTP headers for fine-grained control. Here's the complete list of AI user agents and how to block (or allow) each one.

Why Block (or Allow) AI Crawlers?

AI companies like OpenAI, Anthropic, Google, and others send crawlers to scrape website content for training their language models. Unlike search engine bots (which index your pages for search results), AI crawlers use your content to build commercial products — often without compensation or attribution.

Reason to blockReason to allow
Protect proprietary content from being used in AI trainingGet cited in AI answers (ChatGPT, Perplexity, etc.)
Reduce server load from aggressive crawlingIncrease brand visibility through AI-generated recommendations
Copyright and licensing concernsDrive referral traffic from AI tools that link to sources
Competitive advantage — don't feed competitor AI modelsParticipate in AI Search (Google AI Overview, Bing Chat)

Complete List of AI Crawlers (2026)

User AgentCompanyPurposeRespects robots.txt
GPTBotOpenAITraining data for GPT modelsYes
ChatGPT-UserOpenAIReal-time browsing (ChatGPT with browsing)Yes
OAI-SearchBotOpenAISearchGPT / ChatGPT SearchYes
ClaudeBotAnthropicTraining data for ClaudeYes
anthropic-aiAnthropicWeb browsing for ClaudeYes
Google-ExtendedGoogleTraining Gemini / BardYes
GooglebotGoogleSearch indexing + AI OverviewYes (don't block)
PerplexityBotPerplexity AIAI search engineYes
Applebot-ExtendedAppleApple Intelligence / SiriYes
BytespiderByteDanceTikTok AI trainingPartially
CCBotCommon CrawlOpen dataset (used by many AI labs)Yes
FacebookBotMetaAI training for LlamaYes
meta-externalagentMetaMeta AI browsingYes
cohere-aiCohereEnterprise AI trainingYes
DiffbotDiffbotWeb data extraction for AIPartially

Option 1: Block All AI Crawlers via robots.txt

Add this to your robots.txt file (usually at yoursite.com/robots.txt):

# Block AI training crawlers
User-agent: GPTBot
Disallow: /

User-agent: ChatGPT-User
Disallow: /

User-agent: OAI-SearchBot
Disallow: /

User-agent: ClaudeBot
Disallow: /

User-agent: anthropic-ai
Disallow: /

User-agent: Google-Extended
Disallow: /

User-agent: PerplexityBot
Disallow: /

User-agent: Applebot-Extended
Disallow: /

User-agent: Bytespider
Disallow: /

User-agent: CCBot
Disallow: /

User-agent: FacebookBot
Disallow: /

User-agent: meta-externalagent
Disallow: /

User-agent: cohere-ai
Disallow: /

User-agent: Diffbot
Disallow: /

Option 2: Block Training, Allow AI Search

If you want to appear in AI search results (ChatGPT Search, Perplexity, Google AI Overview) but don't want your content used for training, use this selective configuration:

# Block AI TRAINING crawlers
User-agent: GPTBot
Disallow: /

User-agent: ClaudeBot
Disallow: /

User-agent: Google-Extended
Disallow: /

User-agent: CCBot
Disallow: /

User-agent: Bytespider
Disallow: /

User-agent: FacebookBot
Disallow: /

# ALLOW AI search/browsing bots (so you appear in AI answers)
User-agent: ChatGPT-User
Allow: /

User-agent: OAI-SearchBot
Allow: /

User-agent: PerplexityBot
Allow: /

User-agent: Applebot-Extended
Allow: /

Option 3: HTTP Headers (More Control)

For page-level control, use the X-Robots-Tag HTTP header. This is useful when you want to block AI crawlers from specific pages (like premium content) while allowing them on others.

In your server config (Nginx example):

# Block GPTBot from premium content
location /premium/ {
    add_header X-Robots-Tag "noai, noimageai" always;
}

Google also supports the nosnippet and max-snippet:0 meta tags to prevent content from appearing in AI Overviews:

<meta name="robots" content="max-snippet:0">

How to Verify Your Blocks Are Working

  1. Test robots.txt: Visit yoursite.com/robots.txt and verify the rules are present
  2. Use Google Search Console: The robots.txt tester shows whether specific user agents are blocked
  3. Check server logs: Search for AI bot user agents in your access logs to see if they're still crawling
  4. Use PrivacyChecker: Our scanner checks your robots.txt configuration and flags AI crawlers that aren't blocked (or that are allowed)

The Copyright Angle

As of 2026, several legal precedents affect AI crawling:

  • EU AI Act (2024): Requires AI providers to document training data sources and respect copyright opt-outs
  • EU Copyright Directive (Article 4): Text and data mining for commercial AI requires an opt-out mechanism — robots.txt is the de facto standard
  • NYT v. OpenAI (US, 2024): Established that large-scale scraping for AI training can constitute copyright infringement
  • TDM Reservation Protocol: Some publishers use the tdm-reservation: 1 header to explicitly reserve text/data mining rights

Frequently Asked Questions

Does blocking GPTBot prevent my site from appearing in ChatGPT?

Not exactly. Blocking GPTBot prevents OpenAI from using your content for training future models. But ChatGPT-User is a separate bot used for real-time browsing — if you allow ChatGPT-User, your content can still appear when users ask ChatGPT to browse the web.

Will blocking AI crawlers hurt my Google SEO ranking?

No. Blocking Google-Extended only prevents Google from using your content for Gemini/AI training. It does not affect Googlebot (the search index crawler). Your search rankings are unaffected. However, blocking Googlebotwill remove you from search results entirely — never block Googlebot.

Is robots.txt legally binding?

Not directly, but it's increasingly recognized in court. The EU Copyright Directive recognizes robots.txt as a valid machine-readable opt-out. OpenAI, Anthropic, and Google have all publicly committed to respecting robots.txt. Ignoring a robots.txt block could strengthen a copyright infringement claim.

Check your website now — free

Run a complete privacy audit in under 60 seconds. Get your score, find issues, and learn how to fix them.

Start Free Audit