Regulations

AI Crawling and GDPR: Is AI Training on Your Website Data Legal?

·11 min read

Quick answer: When AI companies crawl your website to train their models, they may be processing personal data — which triggers GDPR obligations. The legal landscape around AI crawling is evolving rapidly, with data protection authorities across Europe issuing new guidance in 2025 and 2026.

The Legal Problem With AI Crawling

Every time an AI crawler like GPTBot or ClaudeBot visits your website, it downloads your content — including any personal data that appears on your pages. This includes names on "About Us" pages, email addresses on contact pages, employee directories, testimonials with real names, and user-generated content.

Under GDPR, this download constitutes data processing. The AI company becomes a data controller for that processing — and they need to comply with all GDPR requirements, including having a lawful basis for processing.

What Legal Basis Do AI Companies Use?

Most AI companies claim legitimate interest (GDPR Article 6(1)(f)) as their legal basis for crawling and training. But this claim is increasingly challenged:

CompanyClaimed Legal BasisDPA ResponseStatus
OpenAI (GPTBot)Legitimate interestItalian DPA banned ChatGPT temporarily in 2023Under ongoing scrutiny
Google (Google-Extended)Legitimate interestMultiple complaints filed to DPAsPending decisions
Meta (Meta-ExternalAgent)Legitimate interest + consentPaused EU AI training after DPC pushbackRestricted in EU
Anthropic (ClaudeBot)Legitimate interestHonors robots.txt opt-outLower regulatory profile
Common Crawl (CCBot)Public interest / researchDebated as training data sourceLegal gray area

Key GDPR Principles at Stake

1. Purpose Limitation (Article 5(1)(b))

When you publish content on your website, the purpose is to inform visitors. AI companies repurpose this content for an entirely different purpose — training machine learning models. This arguably violates the purpose limitation principle, as the data is being used in a way the data subjects never anticipated.

2. Right to Object (Article 21)

Under GDPR, data subjects have the right to object to processing based on legitimate interest. For AI crawling, the robots.txt file has become the de facto objection mechanism. TheEU AI Act (Article 53) now requires AI providers to respect robots.txt directives.

3. Transparency (Article 14)

When AI companies collect data from websites (not directly from data subjects), they must provide information about the processing under Article 14. Most AI companies fail to individually notify website owners or the people whose data appears on crawled pages.

4. Data Minimization (Article 5(1)(c))

AI crawlers typically download entire pages, including content unrelated to their training purpose. This "vacuum everything" approach conflicts with the data minimization principle.

What the EU AI Act Says About Web Crawling

The EU AI Act, which took effect in phases starting August 2024, includes specific provisions relevant to AI crawling:

  • Article 53(1)(c): Providers of general-purpose AI models must put in place a policy to respect the rights of copyright holders, including honoring machine-readable opt-outs like robots.txt
  • Article 53(1)(d): Providers must draw up and make publicly available a sufficiently detailed summary of the content used for training
  • Recital 106: The opt-out mechanism must be "appropriate and proportionate" — robots.txt is explicitly mentioned as one such mechanism

How to Protect Your Website

Step 1: Audit Your Current AI Crawler Exposure

Use PrivacyChecker to scan your website. The audit identifies third-party connections and external services that may include AI-related data collection. Check which AI crawlers are currently accessing your site by reviewing your server access logs.

Step 2: Configure robots.txt

Add explicit directives for AI crawlers in your robots.txt file. See our detailed guide: AI Crawlers and robots.txt: Complete Guide.

Step 3: Add Machine-Readable Rights Statements

Consider adding the TDM Reservation Protocol (Text and Data Mining) headers. The EU DSM Directive allows rights holders to express machine-readable reservations against TDM:

<!-- Add to your HTML <head> -->
<meta name="tdm-reservation" content="1">

<!-- Or via HTTP header -->
TDM-Reservation: 1

Step 4: Update Your Privacy Policy

Your privacy policy should address AI crawling if you're aware of it. Include a statement about automated data collection by third parties and your position on AI training data.

Recent Enforcement Actions

  • Italy (March 2023): Garante temporarily banned ChatGPT for GDPR violations related to data collection and lack of age verification
  • France (2024): CNIL launched investigations into AI companies' data scraping practices under GDPR
  • Ireland (2024): DPC ordered Meta to pause using EU user data for AI training
  • EDPB (2024): Published opinion on AI model training, clarifying legitimate interest requirements
  • Worldwide (2025-2026): Multiple class-action lawsuits filed against AI companies for unauthorized data scraping

What Website Owners Should Do Now

ActionDifficultyImpactTimeline
Add AI crawler rules to robots.txtEasyHighToday
Scan your site for AI-related servicesEasyMediumToday
Add TDM Reservation headersEasyMediumThis week
Update privacy policyMediumHighThis week
Review server logs for AI crawlersMediumHighMonthly
Implement server-level IP blockingHardVery highIf needed

Frequently Asked Questions

Can I sue an AI company for crawling my website?

Potentially, yes. Under GDPR, you can lodge a complaint with your local DPA and seek compensation under Article 82. Several class-action lawsuits are underway in the EU and US. The strength of your case depends on whether the AI company violated your robots.txt directives and processed personal data without a valid legal basis.

Does the GDPR apply to AI crawlers from non-EU companies?

Yes. GDPR applies to any entity processing data of EU residents, regardless of where the company is based (Article 3(2)). OpenAI (US), Anthropic (US), and others must comply with GDPR when crawling EU websites.

Is blocking AI crawlers enough to comply with GDPR?

Blocking AI crawlers is about protecting your content and your visitors' data. GDPR compliance requires broader measures — using PrivacyChecker helps identify all privacy gaps on your site, not just AI-related ones.

Check your website now — free

Run a complete privacy audit in under 60 seconds. Get your score, find issues, and learn how to fix them.

Start Free Audit