Skip to main content
True MarginTrue Margin
Robots.txt for AI: How to Let Search Bots In While Blocking Training Crawlers
← Back to blog

Robots.txt for AI: How to Let Search Bots In While Blocking Training Crawlers

By Jack·April 5, 2026·12 min read

Block training crawlers, allow search crawlers. That's the one-sentence robots.txt strategy for AI in 2026. Your robots.txt file is the only thing standing between your content and the dozens of AI bots now hitting every public website. Some of those bots are scraping your pages to train language models. Others are fetching your content in real time so AI search tools can cite you. You want to stop the first group and welcome the second.

Most site owners don't realize there's a difference. They either block everything (and tank their AI search visibility) or block nothing (and hand their content to training pipelines for free). Both are bad. This guide breaks down every AI crawler you need to know about, which ones to block, which to allow, and the exact robots.txt rules to copy into your site today.

The Two Types of AI Crawlers (and Why It Matters)

Before you touch your robots.txt, you need to understand a distinction that didn't exist two years ago. AI companies now operate two fundamentally different types of web crawlers, and confusing them is the most common mistake we see.

Training crawlers scrape your content to feed it into model training datasets. Your blog post, product descriptions, and FAQ content get ingested, tokenized, and baked into the model's weights. You get zero credit, zero traffic, and zero compensation. The model learns from your content and then competes with you. CCBot (used by Common Crawl, which feeds most open-source LLMs), GPTBot (OpenAI's training crawler), and Google-Extended (Gemini training) are the big ones.

Search crawlers work differently. When someone asks ChatGPT "what's the best project management tool for small teams," ChatGPT-User goes out, fetches relevant pages in real time, reads them, and generates an answer with citations. Your page shows up as a source. You get visibility and potentially a click. PerplexityBot does the same thing for Perplexity. Googlebot feeds Google's AI Overviews.

Training crawlers take your content. Search crawlers send you traffic. Your robots.txt needs to reflect that difference. If you're unsure how well your site currently performs in AI search, the AI Authority Checker will show you exactly where you stand across ChatGPT, Perplexity, Gemini, and Claude.

Every AI Crawler You Need to Know About

Here's the full landscape as of March 2026. This changes frequently because AI companies keep spinning up new bots, but these are the ones that matter right now:

Bot NameOperatorTypePurposeRecommendation
GPTBotOpenAITrainingScrapes content for model training and fine-tuningBlock
ChatGPT-UserOpenAISearchReal-time web browsing during ChatGPT conversationsAllow
OAI-SearchBotOpenAISearchPowers ChatGPT's search feature and link previewsAllow
Google-ExtendedGoogleTrainingFeeds content to Gemini model training (separate from Googlebot)Block
GooglebotGoogleSearchStandard web indexing (also feeds AI Overviews)Allow
PerplexityBotPerplexitySearchReal-time RAG retrieval for Perplexity answersAllow
CCBotCommon CrawlTrainingMassive web scrape dataset used to train most open-source LLMsBlock
ClaudeBotAnthropicTrainingScrapes content for Claude model trainingBlock
anthropic-aiAnthropicTrainingOlder Anthropic training crawler user-agentBlock
cohere-aiCohereTrainingTraining data collection for Cohere modelsBlock
BytespiderByteDanceTrainingScrapes content for TikTok/ByteDance AI modelsBlock
Applebot-ExtendedAppleTrainingFeeds content to Apple Intelligence trainingBlock
ApplebotAppleSearchStandard Apple search (Siri, Spotlight, Safari suggestions)Allow
Meta-ExternalAgentMetaTrainingScrapes content for Meta AI model trainingBlock

Notice the pattern. Every major AI company has split their crawlers into separate user-agents for training vs. search. OpenAI has GPTBot (training) and ChatGPT-User (search). Google has Google-Extended (training) and Googlebot (search). Apple has Applebot-Extended (training) and Applebot (search). This split exists specifically because they knew publishers would want to block one and allow the other. Use it.

The Recommended Robots.txt Configuration

Here's a production-ready robots.txt block you can add to your site. It blocks every known training crawler while keeping all search crawlers open:

# Block AI training crawlers
User-agent: GPTBot
Disallow: /

User-agent: CCBot
Disallow: /

User-agent: Google-Extended
Disallow: /

User-agent: ClaudeBot
Disallow: /

User-agent: anthropic-ai
Disallow: /

User-agent: cohere-ai
Disallow: /

User-agent: Bytespider
Disallow: /

User-agent: Applebot-Extended
Disallow: /

User-agent: Meta-ExternalAgent
Disallow: /

# Allow AI search crawlers (do NOT block these)
User-agent: ChatGPT-User
Allow: /

User-agent: OAI-SearchBot
Allow: /

User-agent: PerplexityBot
Allow: /

User-agent: Googlebot
Allow: /

User-agent: Applebot
Allow: /

A few important notes. Robots.txt rules are case-sensitive for the User-agent string on some crawlers, so match the capitalization exactly. The order of rules matters too: more specific rules should come before more general ones. And if you have an existing User-agent: * block, make sure the specific AI bot rules appear above it.

I think this configuration is the right default for 95% of ecommerce sites. You protect your content from being used as free training data while keeping the door open to every AI search channel that can send you traffic. The only sites that might want a different approach are those actively trying to get into training data on purpose (more on that below).

How to Implement This on Different Platforms

Where you edit your robots.txt depends on your platform. Here's the breakdown:

PlatformHow to Edit robots.txtNotes
ShopifyEdit robots.txt.liquid in your theme templatesCreate the file if it doesn't exist. Changes go live immediately on save.
WordPressYoast SEO > Tools > File editor, or edit the physical file in your root directorySome hosts block direct file editing. Use FTP/SFTP if the plugin method fails.
Next.js / VercelAdd robots.txt to the public/ folder, or use the robots.ts route handlerThe route handler approach lets you dynamically generate rules per environment.
WixSEO settings in the Wix dashboardLimited customization. Wix added AI bot controls in late 2025.
Custom / StaticEdit robots.txt at your site root directlyFull control. Make sure your server serves it at /robots.txt with text/plain content type.

For Shopify specifically: go to your admin panel, navigate to Online Store > Themes > Actions > Edit code, and look for robots.txt.liquid in the Templates folder. If it doesn't exist, create it. Shopify's default robots.txt is fine for traditional SEO but has zero AI crawler rules. You need to add them manually.

What Happens When You Block vs. Allow AI Crawlers

Let's be concrete about the consequences of each approach, because I see a lot of confusion here:

ScenarioTraining CrawlersSearch CrawlersResult
Block all AI botsBlockedBlockedContent protected from training, but invisible to AI search. ChatGPT, Perplexity can't cite you.
Allow all AI botsAllowedAllowedFull AI search visibility, but your content feeds training pipelines for free.
Block training, allow search (recommended)BlockedAllowedContent protected from training. Full visibility in AI search results. Best of both worlds.
No robots.txt rules for AIAllowed (default)Allowed (default)Same as "allow all." No robots.txt rule = open access.

The third option is what we recommend. It's also what most major publishers have adopted. The New York Times, for example, blocks GPTBot and CCBot but allows Googlebot and other search crawlers.

One nuance worth understanding: blocking a training crawler today doesn't retroactively remove your content from models that already trained on it. GPT-4, Claude, and Gemini all trained on web data collected before most sites had AI-specific robots.txt rules. Blocking GPTBot now prevents future training runs from ingesting your new content, but the old stuff is already baked in. That's not a reason to skip the block. It's a reason to implement it sooner rather than later.

The "But I Want to Be in Training Data" Argument

Some site owners deliberately allow training crawlers because they believe being in the training data increases the chance AI recommends them. There's a kernel of truth here, and I think it's worth taking seriously.

If your brand appears frequently in training data, the AI model "knows" about you in a fundamental way. It can reference your brand, products, and reputation without needing to fetch your site in real time. This is why brands like Nike and Apple get mentioned by ChatGPT constantly even when ChatGPT doesn't browse the web. They're in the training data.

But here's the counterpoint: for 99% of ecommerce brands, training data presence doesn't translate to recommendations. AI models are trained on billions of pages. Your 200-page Shopify store is a rounding error. Unless you're already a household name with thousands of third-party mentions, your training data presence won't meaningfully move the needle. What will move it is showing up in real-time AI search results with good content, strong schema markup, and legitimate authority signals.

My opinion: block training crawlers unless you're a major brand with a content licensing strategy. For everyone else, the risk-reward doesn't justify it. Your content gets used for free, and you get nothing measurable in return.

Beyond Robots.txt: Additional Protection Layers

Robots.txt is the first line of defense, but it's not the only one. Here are the additional layers you should consider:

1. HTTP Headers (X-Robots-Tag)

You can add X-Robots-Tag headers to HTTP responses for more granular control. Unlike robots.txt (which operates at the path level), HTTP headers let you set per-page or per-content-type rules. For example, you could allow AI bots to access your blog but block them from product pages via different header configurations.

X-Robots-Tag: noai, noimageai

The noai directive tells compliant bots not to use the page for AI training. The noimageai directive specifically protects images. These are newer directives and not universally supported yet, but Google, OpenAI, and several other major players respect them.

2. TDM Reservation Protocol

The W3C's Text and Data Mining (TDM) Reservation Protocol is a machine-readable rights declaration. You add a tdm-reservation meta tag or HTTP header that explicitly states your content isn't available for text and data mining. It's primarily used in the EU where the DSM Directive gives publishers legal teeth to enforce these claims.

3. AI.txt

Some sites are adopting an ai.txt file (similar to robots.txt) that specifically addresses AI usage rights. It's not an official standard yet, but it's gaining traction. The file sits at your root and declares which AI companies can use your content, for what purposes, and under what terms.

For most sites, robots.txt plus X-Robots-Tag headers cover 95% of what you need. The other layers are worth adding if you're in a content-heavy industry (publishing, media, education) where training data usage is a bigger concern.

How Robots.txt Affects Your AI Visibility Score

Your robots.txt configuration directly impacts how AI search tools can access and cite your content. If you block the wrong bots, your AI visibility score drops because AI search tools literally can't reach your pages during real-time retrieval.

We've seen sites block all AI bots thinking they were "protecting their content" and then wonder why ChatGPT never mentions them. The problem wasn't their content quality or their GEO strategy. It was a three-line robots.txt rule that made them invisible.

To check where you currently stand, run your domain through the AI Authority Checker. It tests your brand across all major AI models and shows you whether you're being cited, ignored, or actively recommended. If your score is low despite having strong content, your robots.txt might be the problem.

Common Mistakes That Kill AI Visibility

These are the robots.txt errors we see most often. Every one of them is fixable in five minutes, and every one costs real traffic when left in place.

Blocking ChatGPT-User when you meant to block GPTBot. These are different bots. GPTBot is training. ChatGPT-User is search. Blocking ChatGPT-User means ChatGPT can't browse your site during conversations, which kills your visibility in the most widely used AI search tool on the planet.

Using User-agent: * with Disallow: /. This blocks everything, including Googlebot, Bingbot, and all AI search crawlers. It's the nuclear option. We've seen stores deploy this during a redesign and forget to remove it. Months of invisible pages.

Not having any AI-specific rules at all. If your robots.txt was last updated in 2023, it probably has rules for Googlebot, Bingbot, and maybe some spam bots. No AI-specific rules means all AI crawlers get full access by default. Your content is being scraped for training right now, and you don't even know it.

Blocking crawlers at the CDN or WAF level instead of robots.txt. Some site owners use Cloudflare or other WAFs to block known AI crawler IPs. This works for enforcement, but it doesn't give AI companies the signal that you're intentionally opting out. Robots.txt is the recognized standard. Use both if you want belt-and-suspenders protection.

Forgetting to test after deploying. Robots.txt changes can have caching delays. After you update your file, verify the live version at yourdomain.com/robots.txt and use Google's robots.txt tester in Search Console to confirm the rules parse correctly.

How This Connects to Your Broader GEO Strategy

Robots.txt is one piece of a larger AI visibility puzzle. It controls access. But access without good content, proper structured data markup, and off-site authority signals won't get you cited. Think of it as the front door: you need to make sure the right bots can get in, but what they find once they're inside determines whether they cite you.

The full AI visibility stack looks like this:

  1. Robots.txt (this guide) controls which bots access your content
  2. Schema markup helps AI parse and understand what your content is about
  3. Content depth gives AI enough material to extract authoritative answers
  4. Off-site signals (Reddit, YouTube, reviews) build the third-party authority AI models weight heavily
  5. Brand consistency across all channels reduces AI "confusion" about who you are and what you sell

If you're just getting started with Generative Engine Optimization, robots.txt is a great first step because it's fast and binary. You either have the right rules or you don't. Fix it once and move on to the harder stuff.

For brands building an active Reddit presence for AI citations, getting robots.txt right is especially important. Reddit content already shows up heavily in AI responses. If AI search crawlers can't access your own site to corroborate what's being said about you on Reddit, you miss the compounding effect of on-site plus off-site authority.

Check your AI visibility right now

Fixing your robots.txt is step one. Step two is knowing whether AI models actually recommend your brand. The AI Authority Checker runs your brand through ChatGPT, Perplexity, Gemini, and Claude with real purchase-intent queries and shows you exactly how often you get cited. Free, instant, no signup required.

Frequently Asked Questions

Does blocking AI crawlers in robots.txt stop AI from mentioning my brand?

No. Blocking training crawlers prevents future content ingestion, but AI models can still mention your brand based on existing training data and third-party sources like Reddit, YouTube, and review sites. If you block search crawlers too, AI search tools won't be able to access your pages during real-time retrieval, which does hurt live citation chances.

What's the difference between an AI search crawler and an AI training crawler?

A search crawler (like ChatGPT-User or PerplexityBot) fetches your content in real time to generate answers with citations. A training crawler (like GPTBot or CCBot) scrapes your content to add to datasets that train or fine-tune language models. Search crawlers drive visibility. Training crawlers take your content for model improvement. Block training, allow search.

How do I edit robots.txt on Shopify?

Go to Online Store > Themes > Edit code, then find or create robots.txt.liquid in the Templates folder. Add your custom User-agent and Disallow rules there. Changes are live immediately on save. Shopify's default robots.txt has no AI-specific rules, so you'll need to add all of them manually.

Should I block all AI bots from my ecommerce store?

No. Blocking all AI bots means ChatGPT, Perplexity, and Google's AI Overviews can't access your content to generate real-time answers. That kills your AI search visibility entirely. Block training crawlers (GPTBot, CCBot, Google-Extended, ClaudeBot, Bytespider). Allow search crawlers (ChatGPT-User, PerplexityBot, Googlebot, Applebot).

Does robots.txt actually prevent AI companies from using my content?

Robots.txt is voluntary. Legitimate crawlers from OpenAI, Google, Anthropic, and Common Crawl respect it. But there's no technical enforcement. Some smaller scraping operations may ignore it entirely. For stronger protection, combine robots.txt with X-Robots-Tag HTTP headers and the TDM Reservation Protocol. Robots.txt is still the first and most widely respected layer.

Will blocking GPTBot hurt my visibility in ChatGPT search?

Only if you also block ChatGPT-User. OpenAI uses separate bots: GPTBot for training, ChatGPT-User for real-time search. Block GPTBot and allow ChatGPT-User to stop training scraping while keeping full visibility in ChatGPT's web-browsing results.

Stop guessing. Start calculating.

True Margin gives ecommerce founders the tools to make data-driven decisions.

Try True Margin Free