Jobs to be done · Diagnose · Ecom · SEO

Find out where AI engines get their facts about your products.

ChatGPT just told a shopper your jacket isn't waterproof. It is. Before you can correct the record, you need to know where the claim came from — because the engine didn't invent it, it retrieved it. Every AI answer about your products is assembled from a source graph you can map: your PDPs, review platforms, Reddit threads, and publisher roundups.

Quick answer

Run buyer-style prompts on the engines that cite sources, capture every cited URL, and classify them into four classes: own PDPs, review platforms, community threads, publishers. Then check what each engine can read on your side — a blocked crawler or missing schema guarantees third parties narrate your product. eCommerce Insights stores every answer and citation verbatim in prompt runs, so the graph builds itself.

The two layers: training data and live retrieval

"Where does AI get its information" has a two-part answer, and the part that matters for commerce is the second one. Training data gives a model broad knowledge with a cutoff date — useful for what a category is, useless for this week's price or last month's product launch. Live retrieval is what shopping answers actually run on: the engine searches the web at answer time, reads a handful of pages, and synthesizes. That is why the cited sources, not the answer prose, are the diagnostic. OpenAI's own crawler documentation splits the two layers explicitly: GPTBot feeds training, OAI-SearchBot feeds the live search index that answers draw on.

For product queries, retrieval draws on four source classes, each with its own failure mode for a brand:

Your PDPs — the only class you edit directly; ignored when crawlers are blocked or schema is missingyou control
Review platforms — engines treat review volume and ratings as trust signals and cite review pages outrightyou influence
Community threads — Reddit is heavily represented in retrieval; old threads outrank new PDPs on "is X actually good" queriesyou participate
Publisher roundups — "best X" listicles often hold the citations brands want; stale ones spread stale specsyou pitch

Where does ChatGPT get its information?

The most-asked version of this job is engine-specific, so here is the ChatGPT answer in full. ChatGPT gets its information from two places: the training data behind the model — general knowledge, frozen at a cutoff date — and live web search, which engages whenever the question needs current facts: prices, availability, anything phrased "best X right now." OpenAI runs separate crawlers for each layer: GPTBot collects training data, OAI-SearchBot builds the index behind ChatGPT search, and ChatGPT-User fetches pages on demand when a question triggers a live lookup. So when people ask where ChatGPT gets its info from in a product answer, the practical answer is: from whatever PDPs, review platforms, Reddit threads, and publisher roundups its search layer could retrieve and parse at answer time.

The same logic answers "where does ChatGPT get its data" about your specific products. If your store admits OAI-SearchBot and ChatGPT-User and your PDPs carry complete Product schema, ChatGPT can read you directly; if not, it reads about you — from third parties, stale details included. The free ChatGPT product visibility checker shows which of the two is happening for any product in about a minute.

The slow way: prompt, screenshot, classify, repeat

The manual version works for a handful of products. Open Perplexity logged out — it attaches 3–7 cited sources to every shopping answer, making it the most legible engine to start with. Run your buyer prompts, copy every cited URL into a spreadsheet, and tag each by source class. Repeat on ChatGPT with web search and on Google, where AI Overviews cites inline. Two warnings from doing this honestly: never ask the model where it got its information after the fact — self-reports are plausible narratives, not retrieval logs — and never trust a single run, because retrieval is non-deterministic and the source mix shifts between runs.

The classification is the valuable part and the part that decays first. Ten products, three engines, ten prompts each, weekly — that is 300 answers to capture and several hundred citations to tag. By week three the spreadsheet is behind, and the question that started the exercise ("why does ChatGPT keep saying our jacket isn't waterproof?") is still open.


The eCommerce Insights way

  1. Capture every answer with its citations, verbatim. Prompt runs store the full response and every cited URL for each product, each engine, each refresh — the raw material of the source graph, with history from day one.
  2. Classify citations automatically. Each cited domain is tagged — own site, review platform, community, publisher, competitor — so the mix per product is a chart, not an afternoon. The citation analysis glossary entry covers why a citation and a mention are different signals.
  3. Check what each engine can read on your side. The agent-readability score cross-checks robots.txt admittance per crawler and Product JSON-LD per page. When an engine describes your product from third parties only, this check usually names the reason — start with the free ChatGPT checker and AI Overviews checker for a one-page version.
  4. Work the graph by class. Fix the first-party layer first (schema, crawl access, answer coverage), then build the third-party grounding engines weight: review depth on the platforms they actually cite, refreshed publisher coverage, honest community presence. When a competitor holds the citations you want, find out why ChatGPT recommends a competitor is the companion job.

What "good" looks like

Directional reads for a mid-market D2C product, from eCommerce Insights tracking as of mid-2026 (illustrative):

Your PDP appears among cited sources, corroborated by fresh third-party coveragehealthy graph
Brand mentioned, but every citation is third-partyfirst-party gap
Citations dominated by one stale roundup or threadfragile — refresh it
Engine answers without citing anything you recognizemap before fixing

The pattern worth internalizing: an engine that cannot read your page does not say so — it answers anyway, from whatever it could read. Mapping the source graph turns "the AI is wrong about us" from a complaint into a fix list with owners.

Ask AI about this job

Have your favorite AI engine apply this walkthrough to your catalog.

Frequently asked questions

Where does AI get its information about products?
Two layers. Training data gives the model general knowledge with a cutoff date, and live retrieval — web search at answer time — supplies the current facts shopping answers are built from. For products, retrieval draws on four source classes: the brand's own PDPs, review platforms, community threads (Reddit is heavily represented), and publisher roundups. Which class dominates varies by engine and by query, which is why the citation list, not the answer text, is the thing to study.
Why does ChatGPT describe my product with details that aren't on my site?
Because it is reading someone else. When an engine cannot retrieve or parse your PDP — blocked crawler, missing Product schema, thin prose — it assembles the product from third-party sources: an old review, a Reddit thread, a publisher's roundup of a previous model year. The fix is making your page the easiest source to retrieve and quote — start with fixing the product schema; until then, third parties narrate, mistakes included.
Which engines actually show their sources?
Perplexity is the most citation-forward, attaching 3–7 sources to a shopping answer. ChatGPT cites when web search engages. Google AI Overviews cites sources inline above the classic results. Gemini, Claude, and Copilot cite with web grounding to varying degrees. Engines in the same answer can disagree because they retrieved different sources — comparing citation lists across engines is the fastest way to see each one's bias.
Can I just ask ChatGPT where it got its information?
Only the citations are trustworthy. Asked to explain itself after the fact, a model produces a plausible narrative, not a retrieval log — self-reports about sources are unreliable by construction. Use the cited URLs captured at answer time; that is exactly what prompt runs store, verbatim, so the source graph is built from evidence rather than the model's memory of itself.
How much of this source graph can I actually control?
Directly: your PDPs — crawler access, Product JSON-LD, answer coverage — and your presence on review platforms. Indirectly: publisher coverage and community sentiment, which respond to outreach and product quality but cannot be edited. The practical split is to fix the first-party layer this quarter (it is the cheapest and the most commonly broken) and treat third-party grounding as the ongoing program.
Where does ChatGPT get its information from?
From model training data and, for anything current, live web search. OpenAI operates three relevant crawlers: GPTBot (training), OAI-SearchBot (the search index), and ChatGPT-User (on-demand fetches). For product questions the search layer dominates, drawing on your PDPs, review platforms, community threads, and publisher roundups — whichever it can retrieve and parse. The cited URLs in an answer are the reliable record of which sources it used.
How does ChatGPT get its information?
By retrieval at answer time. When a question needs current facts, ChatGPT issues search queries, fetches a handful of pages through OAI-SearchBot's index or live ChatGPT-User requests, and composes the answer from what those pages say. If your product pages are blocked or unparseable, it retrieves competitors and third parties instead — which is why crawler admittance and Product schema are the first two checks in this job.

The engines cite their sources. Read them.

Every answer, every citation, stored verbatim per product — the source graph without the spreadsheet.