Find out where AI engines get their facts about your products.
ChatGPT just told a shopper your jacket isn't waterproof. It is. Before you can correct the record, you need to know where the claim came from — because the engine didn't invent it, it retrieved it. Every AI answer about your products is assembled from a source graph you can map: your PDPs, review platforms, Reddit threads, and publisher roundups.
Run buyer-style prompts on the engines that cite sources, capture every cited URL, and classify them into four classes: own PDPs, review platforms, community threads, publishers. Then check what each engine can read on your side — a blocked crawler or missing schema guarantees third parties narrate your product. eCommerce Insights stores every answer and citation verbatim in prompt runs, so the graph builds itself.
The two layers: training data and live retrieval
"Where does AI get its information" has a two-part answer, and the part that matters for commerce is the second one. Training data gives a model broad knowledge with a cutoff date — useful for what a category is, useless for this week's price or last month's product launch. Live retrieval is what shopping answers actually run on: the engine searches the web at answer time, reads a handful of pages, and synthesizes. That is why the cited sources, not the answer prose, are the diagnostic. OpenAI's own crawler documentation splits the two layers explicitly: GPTBot feeds training, OAI-SearchBot feeds the live search index that answers draw on.
For product queries, retrieval draws on four source classes, each with its own failure mode for a brand:
Where does ChatGPT get its information?
The most-asked version of this job is engine-specific, so here is the ChatGPT answer in full. ChatGPT gets its information from two places: the training data behind the model — general knowledge, frozen at a cutoff date — and live web search, which engages whenever the question needs current facts: prices, availability, anything phrased "best X right now." OpenAI runs separate crawlers for each layer: GPTBot collects training data, OAI-SearchBot builds the index behind ChatGPT search, and ChatGPT-User fetches pages on demand when a question triggers a live lookup. So when people ask where ChatGPT gets its info from in a product answer, the practical answer is: from whatever PDPs, review platforms, Reddit threads, and publisher roundups its search layer could retrieve and parse at answer time.
The same logic answers "where does ChatGPT get its data" about your specific products. If your store admits OAI-SearchBot and ChatGPT-User and your PDPs carry complete Product schema, ChatGPT can read you directly; if not, it reads about you — from third parties, stale details included. The free ChatGPT product visibility checker shows which of the two is happening for any product in about a minute.
The slow way: prompt, screenshot, classify, repeat
The manual version works for a handful of products. Open Perplexity logged out — it attaches 3–7 cited sources to every shopping answer, making it the most legible engine to start with. Run your buyer prompts, copy every cited URL into a spreadsheet, and tag each by source class. Repeat on ChatGPT with web search and on Google, where AI Overviews cites inline. Two warnings from doing this honestly: never ask the model where it got its information after the fact — self-reports are plausible narratives, not retrieval logs — and never trust a single run, because retrieval is non-deterministic and the source mix shifts between runs.
The classification is the valuable part and the part that decays first. Ten products, three engines, ten prompts each, weekly — that is 300 answers to capture and several hundred citations to tag. By week three the spreadsheet is behind, and the question that started the exercise ("why does ChatGPT keep saying our jacket isn't waterproof?") is still open.
The eCommerce Insights way
- Capture every answer with its citations, verbatim. Prompt runs store the full response and every cited URL for each product, each engine, each refresh — the raw material of the source graph, with history from day one.
- Classify citations automatically. Each cited domain is tagged — own site, review platform, community, publisher, competitor — so the mix per product is a chart, not an afternoon. The citation analysis glossary entry covers why a citation and a mention are different signals.
- Check what each engine can read on your side. The agent-readability score cross-checks robots.txt admittance per crawler and Product JSON-LD per page. When an engine describes your product from third parties only, this check usually names the reason — start with the free ChatGPT checker and AI Overviews checker for a one-page version.
- Work the graph by class. Fix the first-party layer first (schema, crawl access, answer coverage), then build the third-party grounding engines weight: review depth on the platforms they actually cite, refreshed publisher coverage, honest community presence. When a competitor holds the citations you want, find out why ChatGPT recommends a competitor is the companion job.
What "good" looks like
Directional reads for a mid-market D2C product, from eCommerce Insights tracking as of mid-2026 (illustrative):
The pattern worth internalizing: an engine that cannot read your page does not say so — it answers anyway, from whatever it could read. Mapping the source graph turns "the AI is wrong about us" from a complaint into a fix list with owners.
Ask AI about this job
Have your favorite AI engine apply this walkthrough to your catalog.
Frequently asked questions
Where does AI get its information about products?
Why does ChatGPT describe my product with details that aren't on my site?
Which engines actually show their sources?
Can I just ask ChatGPT where it got its information?
How much of this source graph can I actually control?
Where does ChatGPT get its information from?
How does ChatGPT get its information?
The engines cite their sources. Read them.
Every answer, every citation, stored verbatim per product — the source graph without the spreadsheet.