Learn how to use robots.txt to control which AI crawlers can access your site, and understand the strategic decisions around AI crawler access in 2026.
When OpenAI launched GPTBot — the web crawler used to gather training data for GPT models — it simultaneously published documentation explaining how to block it using robots.txt. Almost overnight, a technical file that most marketers never thought about became the centre of a strategic debate: should you let AI companies crawl your content? And if so, which ones?
In 2026, that question has expanded significantly. There are now a dozen or more AI crawlers regularly hitting web servers, each serving different purposes. Understanding what they do, which ones you should allow, and how to configure your robots.txt accordingly is a practical necessity for any site with a meaningful online presence.
The major AI crawlers currently operating fall into two categories:
These crawlers collect content to train or fine-tune AI models. Allowing them means your content may appear in the training data of the next model version; blocking them means it will not.
These crawlers collect content to answer real-time queries. Allowing them means your content can appear in AI-generated answers; blocking them means it cannot.
Each legitimate AI crawler identifies itself via its User-agent string and lists the allowed IP ranges in its documentation. Before blocking or allowing specific bots, verify that the crawlers hitting your server are legitimate (bad actors can spoof user-agent strings; compare with published IP ranges if you need certainty).
This is where organisations need to think carefully, because the right answer depends on your business model and your goals.
AI visibility. If you block PerplexityBot, you cannot appear in Perplexity answers. If you block ClaudeBot, Claude cannot access your content for web searches. Blocking retrieval crawlers is a direct opt-out from AI-generated answers.
Competitive disadvantage. Your competitors who allow these crawlers will appear in AI answers; you will not. In markets where AI-generated recommendations are increasingly influential in purchase decisions, that is a compounding disadvantage.
Training data influence. If you allow training data crawlers, your content shapes how future models understand your industry and your brand. This is a long-term brand investment.
Content protection. If you produce premium, high-value content — journalism, research, creative writing — AI companies are effectively using that content to build commercial products without compensation. Many publishers have taken the position that this is unacceptable without a licensing agreement.
Competitive intelligence concerns. Some businesses worry that AI systems trained on their content could help competitors or reduce the unique value of their proprietary knowledge.
Paywalled content. If your content is behind a paywall, AI crawlers accessing it and distributing it for free undermines your business model.
The emerging consensus in the content industry is nuanced: many publishers block training crawlers (GPTBot, Google-Extended) while allowing retrieval crawlers (PerplexityBot, ClaudeBot) — reasoning that retrieval at least results in citations and referral traffic, while training data collection does not.
The robots.txt file lives at the root of your domain (e.g. https://yourdomain.com/robots.txt) and uses a simple directive syntax.
If you want to block all AI crawlers while maintaining Googlebot access:
User-agent: GPTBot
Disallow: /
User-agent: Google-Extended
Disallow: /
User-agent: PerplexityBot
Disallow: /
User-agent: ClaudeBot
Disallow: /
User-agent: anthropic-ai
Disallow: /
User-agent: Meta-ExternalAgent
Disallow: /
User-agent: DuckAssistBot
Disallow: /
User-agent: Applebot-Extended
Disallow: /
User-agent: CCBot
Disallow: /
A common compromise — allow bots that provide citations and referral traffic, block those that only collect training data:
# Block training-only crawlers
User-agent: GPTBot
Disallow: /
User-agent: Google-Extended
Disallow: /
User-agent: CCBot
Disallow: /
# Allow retrieval crawlers (Perplexity, Claude web access)
User-agent: PerplexityBot
Allow: /
User-agent: ClaudeBot
Allow: /
You do not have to make an all-or-nothing decision. Block AI crawlers from specific sections:
User-agent: GPTBot
Disallow: /premium-content/
Disallow: /research/
Allow: /blog/
Allow: /guides/
This approach lets you protect high-value proprietary content while making your marketing content and public guides available for training and retrieval.
Beyond robots.txt, a new convention is emerging: llms.txt. Proposed by answer.ai, the llms.txt file is a structured document placed at your domain root that communicates to AI systems:
Think of it as a curated index for AI systems — telling them not just what they can access, but what matters most. While adoption is still growing and AI systems are not universally parsing it yet, implementing llms.txt is a low-effort forward investment.
A basic llms.txt looks like this:
# Surfaceable
> AI visibility and SEO tracking platform. Track your brand's presence
> in ChatGPT, Claude, Gemini, and Perplexity.
## Key pages
- [About](https://surfaceable.io/about): What Surfaceable is and who it's for
- [Blog](https://surfaceable.io/blog): Guides on AI visibility and SEO
- [Documentation](https://surfaceable.io/docs): How to use Surfaceable
## Optional
- [Pricing](https://surfaceable.io/pricing): Subscription plans and pricing
Regardless of your allow/block decisions, monitoring which AI crawlers are hitting your site is good practice. Check your server access logs or analytics for user-agent strings associated with known AI crawlers:
GPTBotPerplexityBotClaudeBotanthropic-aiGoogle-Extendedmeta-externalagentUnderstanding traffic volumes from each crawler helps you make informed decisions about which ones to prioritise and validates that your robots.txt directives are being respected.
Most major AI companies commit to honouring robots.txt directives. OpenAI, Anthropic, Google, and Perplexity have all published documentation confirming this. However:
If you discover a crawler ignoring your robots.txt, the recourse is to contact the operator directly (most publish a contact for this purpose) and, if necessary, use your server's firewall to block the crawler's IP ranges.
Your robots.txt file is now doing more work than ever. In 2026, it is not just about Googlebot — it is a policy document for how your content is used by an expanding ecosystem of AI systems.
The strategic decision — allow, partially allow, or block — depends on your business model and goals. For most brands with public content and an interest in AI visibility, allowing retrieval crawlers while being selective about training data crawlers is a sensible default. Implement llms.txt as a complementary signal. And monitor your AI crawler traffic regularly to ensure your policies are being respected and to inform future decisions.
This is one of those technical details that has outsized strategic consequences. Get your AI crawler policy right, and document it in your robots.txt deliberately — not by accident or default.
Try Surfaceable
See how often ChatGPT, Claude, Gemini, and Perplexity mention your brand — and get a full technical SEO audit. Free to start.
Get started free →