Blog · Technical

Robots.txt for AI crawlers: who to allow and why.

The user-agents that decide whether engines can read your PDPs, what each one actually controls, and a starter policy for ecommerce stores.

eCommerce Insights research team · · 8 min read


A surprising number of stores are invisible to AI shopping engines by accident. Somewhere between a theme migration, a bot-protection rollout, and a 2023-era "block the AI scrapers" decision, the robots.txt ended up denying the crawlers that feed ChatGPT and Perplexity their product answers. The brand then wonders why competitors get cited and it does not. The fix costs one file edit; knowing what to put in the file requires knowing what each user-agent does.

Three kinds of AI bot

The agents hitting your robots.txt fall into three classes, and the right policy differs per class.

  1. Training crawlers collect content to train future models. GPTBot (OpenAI), Google-Extended (Google), CCBot (Common Crawl), Applebot-Extended (Apple). Blocking them is a philosophical choice about training data; it has little effect on whether today's engines cite you.
  2. Search and answer crawlers build the indexes that shopping answers retrieve from. OAI-SearchBot (ChatGPT search and shopping), PerplexityBot, and ordinary Googlebot, which feeds AI Overviews. Blocking these removes you from the answers directly.
  3. User-action agents fetch a page live when a user or shopping agent needs it: ChatGPT-User, Perplexity-User, Claude-User. Blocking these breaks the moment a draft-cart agent tries to read your price and availability.

The classes matter because the trade-offs differ. A publisher whose content is the product may rationally block class one. An ecommerce brand whose PDP is an advertisement wants classes two and three reading everything.

The user-agents that matter, as of mid-2026

User-agentOperatorFeedsEcommerce call
GPTBotOpenAIModel trainingallow
OAI-SearchBotOpenAIChatGPT search + shopping indexallow
ChatGPT-UserOpenAILive fetches for users and agentsallow
PerplexityBotPerplexityPerplexity index and shoppingallow
ClaudeBotAnthropicClaude training and retrievalallow
Google-ExtendedGoogleGemini training (not AI Overviews)allow
ApplebotAppleSiri and Spotlight answersallow
CCBotCommon CrawlOpen datasets many models train onallow
BytespiderByteDanceAggressive crawling, no shopping surfaceblock

Names per each operator's published documentation, mid-2026. OpenAI's bot documentation is the canonical reference for its three agents and their IP ranges.

The Google-Extended confusion

The single most common misreading the research team encounters: blocking Google-Extended in the belief that it controls AI Overviews. It does not. Google-Extended governs whether content trains Gemini models. AI Overviews and AI Mode are Search features built on Googlebot's ordinary index — the only way out of them is out of Search itself. Google documents the split in its crawler documentation. Decide the training-data question and the visibility question separately; conflating them is how stores end up with a policy that does nothing they intended.

A publisher's content is the product, so it gets protected. A PDP is an advertisement — you want every machine reading it.

A starter policy for ecommerce

# AI answer + shopping crawlers — admit
User-agent: GPTBot
User-agent: OAI-SearchBot
User-agent: ChatGPT-User
User-agent: PerplexityBot
User-agent: ClaudeBot
User-agent: Google-Extended
User-agent: Applebot
User-agent: CCBot
Allow: /

# Aggressive, no commerce surface — block
User-agent: Bytespider
Disallow: /

Sitemap: https://yourstore.com/sitemap.xml

Keep the usual exclusions — cart, checkout, account paths — in your general rules; AI crawlers have no business in them and excluding them keeps crawl budget on PDPs. And remember robots.txt is a request, not a wall: a bot that ignores the convention needs rate limiting at the CDN or WAF, not another Disallow line.

Verify what your store actually serves

Shopify generates a default robots.txt that you can override with a robots.txt.liquid template; apps and bot-protection layers sometimes inject rules above it. Do not trust the theme — fetch /robots.txt and read it. The free Agentic Readiness Grader checks the major AI user-agents against your live file as part of its score, and the AI Agent Lens feature runs the same evaluation across 15 bots continuously. Pair the robots check with an llms.txt at the root — admittance and a curated summary are complementary, as the llms.txt for Shopify guide lays out.

Why this is worth an hour

Crawler admittance is one of the five inputs to the agent-readability score, and it is the only one that can zero out everything else: perfect Product JSON-LD behind a Disallow is invisible. It is also the cheapest input to fix — one file, one deploy, no copywriting. In the audits the research team ran in the first half of 2026, blocked or partially blocked AI crawlers showed up on a meaningful minority of otherwise well-optimized stores, almost always as a leftover from a 2023–2024 blanket policy. Check yours before spending a dollar on content. Then let eCommerce Insights watch it weekly, because app updates and WAF rules regress robots.txt silently — and the product AI visibility guide covers where admittance sits in the larger stack.

Key takeaways

  • AI bots come in three classes: training, answer-index, and live user-action. Policy per class, not per vibe.
  • For ecommerce, allow the answer and action agents — they are how your PDPs get cited and carted.
  • Google-Extended controls Gemini training, not AI Overviews. Googlebot controls AI Overviews.
  • Block Bytespider; handle rogue scrapers at the WAF, not in robots.txt.
  • Fetch your live /robots.txt and verify — admittance gaps zero out every other optimization.

Ask AI about robots.txt for AI crawlers

Have your preferred AI engine review the policy for your store.

Frequently asked questions

Should an ecommerce store block AI crawlers in robots.txt?
For most stores, no. Blocking GPTBot, ClaudeBot, or PerplexityBot removes your PDPs from the retrieval sets those engines cite from, which means competitors get the shopping answers your products should be in. The publisher calculus — content is the product, so protect it — runs the other way for commerce, where the PDP is an ad you want machines to read.
Does blocking Google-Extended remove me from Google AI Overviews?
No, and this is the most common confusion. Google-Extended controls whether your content trains Gemini models. AI Overviews and AI Mode are Search features fed by Googlebot — blocking Google-Extended does not remove you from them, and blocking Googlebot removes you from Search entirely. Decide the training question separately from the visibility question.
What is the difference between GPTBot, OAI-SearchBot, and ChatGPT-User?
Per OpenAI's published bot documentation: GPTBot crawls for model training. OAI-SearchBot indexes for ChatGPT search results and shopping answers. ChatGPT-User fetches a page live when a user or agent action requires it. For commerce visibility, OAI-SearchBot and ChatGPT-User matter most; many brands allow all three.
Which AI bot should ecommerce stores block?
Bytespider, ByteDance's crawler, is the usual block: it has been widely reported to ignore robots.txt conventions and crawl aggressively, and it feeds no shopping surface a D2C brand wins from as of mid-2026. Aggressive unknown scrapers belong at the WAF level instead, since a bot that ignores robots.txt is not stopped by it.

Find out which bots your store turns away.

The Agentic Readiness Grader checks the major AI user-agents against your live robots.txt in about 30 seconds. Free, no signup.