robots.txt and AI Crawlers: Controlling What LLMs Index on Your Site

When OpenAI launched GPTBot — the web crawler used to gather training data for GPT models — it simultaneously published documentation explaining how to block it using robots.txt. Almost overnight, a technical file that most marketers never thought about became the centre of a strategic debate: should you let AI companies crawl your content? And if so, which ones?

In 2026, that question has expanded significantly. There are now a dozen or more AI crawlers regularly hitting web servers, each serving different purposes. Understanding what they do, which ones you should allow, and how to configure your robots.txt accordingly is a practical necessity for any site with a meaningful online presence.

The AI Crawler Landscape in 2026

The major AI crawlers currently operating fall into two categories:

Training Data Crawlers

These crawlers collect content to train or fine-tune AI models. Allowing them means your content may appear in the training data of the next model version; blocking them means it will not.

GPTBot (OpenAI) — crawls for GPT model training data
Google-Extended — Google's crawler specifically for AI training (separate from Googlebot)
CCBot (Common Crawl) — feeds into many open-source and research models

Retrieval Crawlers

These crawlers collect content to answer real-time queries. Allowing them means your content can appear in AI-generated answers; blocking them means it cannot.

PerplexityBot — Perplexity's real-time retrieval crawler
ClaudeBot — Anthropic's crawler for Claude's web access
Applebot-Extended — Apple's AI features crawler
Meta-ExternalAgent — Meta's AI crawler
DuckAssistBot — DuckDuckGo's AI answer crawler

How to Identify Crawlers

Each legitimate AI crawler identifies itself via its User-agent string and lists the allowed IP ranges in its documentation. Before blocking or allowing specific bots, verify that the crawlers hitting your server are legitimate (bad actors can spoof user-agent strings; compare with published IP ranges if you need certainty).

The Strategic Decision: Allow or Block?

This is where organisations need to think carefully, because the right answer depends on your business model and your goals.

Arguments for Allowing AI Crawlers

AI visibility. If you block PerplexityBot, you cannot appear in Perplexity answers. If you block ClaudeBot, Claude cannot access your content for web searches. Blocking retrieval crawlers is a direct opt-out from AI-generated answers.

Competitive disadvantage. Your competitors who allow these crawlers will appear in AI answers; you will not. In markets where AI-generated recommendations are increasingly influential in purchase decisions, that is a compounding disadvantage.

Training data influence. If you allow training data crawlers, your content shapes how future models understand your industry and your brand. This is a long-term brand investment.

Arguments for Blocking AI Crawlers

Content protection. If you produce premium, high-value content — journalism, research, creative writing — AI companies are effectively using that content to build commercial products without compensation. Many publishers have taken the position that this is unacceptable without a licensing agreement.

Competitive intelligence concerns. Some businesses worry that AI systems trained on their content could help competitors or reduce the unique value of their proprietary knowledge.

Paywalled content. If your content is behind a paywall, AI crawlers accessing it and distributing it for free undermines your business model.

The emerging consensus in the content industry is nuanced: many publishers block training crawlers (GPTBot, Google-Extended) while allowing retrieval crawlers (PerplexityBot, ClaudeBot) — reasoning that retrieval at least results in citations and referral traffic, while training data collection does not.

Configuring robots.txt for AI Crawlers

The robots.txt file lives at the root of your domain (e.g. https://yourdomain.com/robots.txt) and uses a simple directive syntax.

Blocking All AI Crawlers

If you want to block all AI crawlers while maintaining Googlebot access:

User-agent: GPTBot
Disallow: /

User-agent: Google-Extended
Disallow: /

User-agent: PerplexityBot
Disallow: /

User-agent: ClaudeBot
Disallow: /

User-agent: anthropic-ai
Disallow: /

User-agent: Meta-ExternalAgent
Disallow: /

User-agent: DuckAssistBot
Disallow: /

User-agent: Applebot-Extended
Disallow: /

User-agent: CCBot
Disallow: /

Allowing Retrieval Crawlers, Blocking Training Crawlers

A common compromise — allow bots that provide citations and referral traffic, block those that only collect training data:

# Block training-only crawlers
User-agent: GPTBot
Disallow: /

User-agent: Google-Extended
Disallow: /

User-agent: CCBot
Disallow: /

# Allow retrieval crawlers (Perplexity, Claude web access)
User-agent: PerplexityBot
Allow: /

User-agent: ClaudeBot
Allow: /

Granular Path-Level Control

You do not have to make an all-or-nothing decision. Block AI crawlers from specific sections:

User-agent: GPTBot
Disallow: /premium-content/
Disallow: /research/
Allow: /blog/
Allow: /guides/

This approach lets you protect high-value proprietary content while making your marketing content and public guides available for training and retrieval.

llms.txt: The Emerging Standard for AI Communication

Beyond robots.txt, a new convention is emerging: llms.txt. Proposed by answer.ai, the llms.txt file is a structured document placed at your domain root that communicates to AI systems:

What your site is about
Which content is most relevant and important
How AI systems should use and attribute your content
Links to key documents and sections

Think of it as a curated index for AI systems — telling them not just what they can access, but what matters most. While adoption is still growing and AI systems are not universally parsing it yet, implementing llms.txt is a low-effort forward investment.

A basic llms.txt looks like this:

# Surfaceable

> AI visibility and SEO tracking platform. Track your brand's presence
> in ChatGPT, Claude, Gemini, and Perplexity.

## Key pages
- [About](https://surfaceable.io/about): What Surfaceable is and who it's for
- [Blog](https://surfaceable.io/blog): Guides on AI visibility and SEO
- [Documentation](https://surfaceable.io/docs): How to use Surfaceable

## Optional
- [Pricing](https://surfaceable.io/pricing): Subscription plans and pricing

Monitoring AI Crawler Activity

Regardless of your allow/block decisions, monitoring which AI crawlers are hitting your site is good practice. Check your server access logs or analytics for user-agent strings associated with known AI crawlers:

GPTBot
PerplexityBot
ClaudeBot
anthropic-ai
Google-Extended
meta-externalagent

Understanding traffic volumes from each crawler helps you make informed decisions about which ones to prioritise and validates that your robots.txt directives are being respected.

When AI Crawlers Ignore robots.txt

Most major AI companies commit to honouring robots.txt directives. OpenAI, Anthropic, Google, and Perplexity have all published documentation confirming this. However:

Not all crawlers are legitimate — some data collection operations ignore robots.txt
Historical web archives (Common Crawl snapshots) may contain content collected before your robots.txt restrictions were in place

If you discover a crawler ignoring your robots.txt, the recourse is to contact the operator directly (most publish a contact for this purpose) and, if necessary, use your server's firewall to block the crawler's IP ranges.

Conclusion

Your robots.txt file is now doing more work than ever. In 2026, it is not just about Googlebot — it is a policy document for how your content is used by an expanding ecosystem of AI systems.

The strategic decision — allow, partially allow, or block — depends on your business model and goals. For most brands with public content and an interest in AI visibility, allowing retrieval crawlers while being selective about training data crawlers is a sensible default. Implement llms.txt as a complementary signal. And monitor your AI crawler traffic regularly to ensure your policies are being respected and to inform future decisions.

This is one of those technical details that has outsized strategic consequences. Get your AI crawler policy right, and document it in your robots.txt deliberately — not by accident or default.