A surprising number of stores are invisible to AI shopping engines by accident. Somewhere between a theme migration, a bot-protection rollout, and a 2023-era "block the AI scrapers" decision, the robots.txt ended up denying the crawlers that feed ChatGPT and Perplexity their product answers. The brand then wonders why competitors get cited and it does not. The fix costs one file edit; knowing what to put in the file requires knowing what each user-agent does.
Three kinds of AI bot
The agents hitting your robots.txt fall into three classes, and the right policy differs per class.
- Training crawlers collect content to train future models. GPTBot (OpenAI), Google-Extended (Google), CCBot (Common Crawl), Applebot-Extended (Apple). Blocking them is a philosophical choice about training data; it has little effect on whether today's engines cite you.
- Search and answer crawlers build the indexes that shopping answers retrieve from. OAI-SearchBot (ChatGPT search and shopping), PerplexityBot, and ordinary Googlebot, which feeds AI Overviews. Blocking these removes you from the answers directly.
- User-action agents fetch a page live when a user or shopping agent needs it: ChatGPT-User, Perplexity-User, Claude-User. Blocking these breaks the moment a draft-cart agent tries to read your price and availability.
The classes matter because the trade-offs differ. A publisher whose content is the product may rationally block class one. An ecommerce brand whose PDP is an advertisement wants classes two and three reading everything.
The user-agents that matter, as of mid-2026
| User-agent | Operator | Feeds | Ecommerce call |
|---|---|---|---|
GPTBot | OpenAI | Model training | allow |
OAI-SearchBot | OpenAI | ChatGPT search + shopping index | allow |
ChatGPT-User | OpenAI | Live fetches for users and agents | allow |
PerplexityBot | Perplexity | Perplexity index and shopping | allow |
ClaudeBot | Anthropic | Claude training and retrieval | allow |
Google-Extended | Gemini training (not AI Overviews) | allow | |
Applebot | Apple | Siri and Spotlight answers | allow |
CCBot | Common Crawl | Open datasets many models train on | allow |
Bytespider | ByteDance | Aggressive crawling, no shopping surface | block |
Names per each operator's published documentation, mid-2026. OpenAI's bot documentation is the canonical reference for its three agents and their IP ranges.
The Google-Extended confusion
The single most common misreading the research team encounters: blocking Google-Extended in the belief that it controls AI Overviews. It does not. Google-Extended governs whether content trains Gemini models. AI Overviews and AI Mode are Search features built on Googlebot's ordinary index — the only way out of them is out of Search itself. Google documents the split in its crawler documentation. Decide the training-data question and the visibility question separately; conflating them is how stores end up with a policy that does nothing they intended.