The landscape of web crawling has fundamentally changed. Whilst traditional search engine crawlers have been respectful guests for decades, AI companies have introduced a new generation of bots that consume web content for large language model training and AI-powered search. Understanding how to control these crawlers through your robots.txt file has become essential for website owners who want control over how their content is used.

Search Crawlers vs AI Training Crawlers

Not all crawlers serve the same purpose. Traditional search engine crawlers like Googlebot and Bingbot index content to make it discoverable through search. Users search, find your site, and you gain traffic. It's a mutually beneficial relationship.

AI crawlers operate with different goals. Some, like PerplexityBot, power AI search engines that cite sources and send traffic. Others, like Google-Extended or CCBot, collect content primarily for training large language models. Your content becomes part of the training data, but you may not receive traffic or attribution in return.

This distinction matters when configuring your robots.txt file. You might want to allow AI search crawlers whilst blocking AI training bots, or you might have different preferences for different companies.

The Major AI Crawlers

As of early 2026, several AI crawlers actively traverse the web. Here are the most significant ones:

GPTBot (OpenAI)

GPTBot is OpenAI's web crawler, used to gather content for training models and powering ChatGPT Search. OpenAI respects robots.txt directives.

User-agent string: GPTBot

ClaudeBot (Anthropic)

ClaudeBot crawls content for Anthropic's Claude models. Anthropic respects robots.txt exclusions and provides documentation on how to control their crawler.

User-agent string: ClaudeBot

PerplexityBot (Perplexity AI)

Perplexity operates an AI search engine that cites sources and can drive traffic back to your site. Their crawler gathers content for real-time search answers.

User-agent string: PerplexityBot

Google-Extended (Google)

Google-Extended is separate from Googlebot. Whilst Googlebot indexes content for traditional search, Google-Extended collects data for training Google's generative AI models including Gemini. You can block Google-Extended whilst still allowing Googlebot.

User-agent string: Google-Extended

CCBot (Common Crawl)

Common Crawl is a non-profit that creates publicly available web crawl data. Whilst not directly an AI company, their datasets are widely used for training language models, making CCBot a significant contributor to AI training data.

User-agent string: CCBot

Bytespider (ByteDance)

ByteDance, the company behind TikTok, operates Bytespider to collect data for AI model training and other purposes.

User-agent string: Bytespider

FacebookBot (Meta)

Meta uses FacebookBot for various purposes, including training their Llama models and other AI initiatives.

User-agent string: FacebookBot

Amazonbot (Amazon)

Amazon's crawler supports both their search capabilities and AI initiatives. They respect robots.txt directives.

User-agent string: Amazonbot

Configuring robots.txt for AI Crawlers

Your robots.txt file sits at the root of your domain (e.g. https://example.com/robots.txt) and instructs crawlers which parts of your site they can access.

Blocking All AI Crawlers

If you want to prevent all major AI bots from accessing your content:

# Block AI training crawlers
User-agent: GPTBot
Disallow: /

User-agent: ClaudeBot
Disallow: /

User-agent: Google-Extended
Disallow: /

User-agent: CCBot
Disallow: /

User-agent: PerplexityBot
Disallow: /

User-agent: Bytespider
Disallow: /

User-agent: FacebookBot
Disallow: /

User-agent: Amazonbot
Disallow: /

Allowing Search Crawlers, Blocking Training

You can maintain your search engine presence whilst opting out of AI training:

# Allow traditional search engines
User-agent: Googlebot
Allow: /

User-agent: Bingbot
Allow: /

# Allow AI search (sends traffic)
User-agent: PerplexityBot
Allow: /

# Block AI training crawlers
User-agent: Google-Extended
Disallow: /

User-agent: GPTBot
Disallow: /

User-agent: CCBot
Disallow: /

User-agent: Bytespider
Disallow: /

Selective Path Blocking

You might want to allow AI crawlers to access certain sections whilst protecting others:

# Allow AI crawlers for marketing pages, block premium content
User-agent: GPTBot
Disallow: /blog/premium/
Disallow: /members/
Allow: /

User-agent: ClaudeBot
Disallow: /blog/premium/
Disallow: /members/
Allow: /

Limitations of robots.txt

Whilst robots.txt is the standard method for controlling crawler access, it has important limitations:

It's Voluntary

robots.txt is a protocol, not a technical enforcement mechanism. Reputable companies honour these directives, but there's no way to technically prevent a crawler from ignoring them.

No Retroactive Effect

Blocking a crawler today doesn't remove data already collected. If an AI company crawled your site before you implemented blocks, that content may already be part of their training data.

Discovery Still Happens

Even blocked crawlers access your robots.txt file to read the directives. They know your site exists — they simply won't crawl the disallowed sections.

Monitoring and Verification

After configuring your robots.txt, verify it works correctly:

Test Your Configuration

Use online validators to check your robots.txt syntax. Google Search Console includes a robots.txt tester for Googlebot-specific rules.

Analyse Server Logs

Review your web server logs to see which user-agents are accessing your site. Look for the AI crawler user-agent strings listed above.

Use Automated Tools

Tools like GEO Lantern's AI crawler scanner can analyse your site's configuration and help you understand which AI crawlers can currently access your content. The AI readiness score evaluates your configuration and offers recommendations.

For more granular analysis, the AI crawler access feature shows exactly which crawlers are allowed or blocked based on your current robots.txt configuration.

Best Practices

Be Explicit

Don't rely on default behaviour. Explicitly allow or disallow each crawler type. This creates a clear record of your intentions.

Differentiate Between Use Cases

Consider treating search crawlers (discovery) differently from training crawlers (consumption). You might want to be found whilst still protecting your content from being used as training data.

Review Regularly

The landscape changes frequently. New AI companies launch, existing ones introduce new crawlers, and policies evolve. Review your robots.txt configuration quarterly.

Document Your Reasoning

Keep internal documentation about why you've configured robots.txt in a particular way. This helps when team members or stakeholders question the decisions.

Making Your Decision

Your robots.txt configuration for AI crawlers is both a technical implementation and a business decision:

Marketing sites might welcome AI crawlers for increased visibility in AI search results
Publishers with premium content might block training crawlers whilst allowing search crawlers
E-commerce sites might allow product pages to be crawled for AI shopping features
Technical blogs might allow crawling to increase citation frequency

There's no universally correct answer. Your decision should reflect your content's value, your business model, and how you want AI systems to interact with your work.

The configuration itself is straightforward — add user-agent blocks to your robots.txt file. The harder decisions involve which crawlers to block, which to allow, and how to balance search visibility with content protection.