A familiar scene from mid-2026: the CEO asks ChatGPT for the best products in the brand's category and the brand is the first recommendation. The head of growth runs the identical prompt an hour later and the brand is absent. Both screenshots land in Slack; an argument follows about which one is "right." The answer is neither — and understanding why is the difference between measuring AI visibility and collecting anecdotes.
Five mechanisms make answers vary
1. Sampling: the model rolls dice on every word
Language models generate text probabilistically — at each step the model picks among likely next words rather than always taking one fixed choice. Identical prompt, identical user, identical moment: different runs still produce different answers, and in a recommendation list that difference is often which brand fills slot three. This is by design, not a bug, and it alone guarantees that "the" ChatGPT answer to a shopping question does not exist.
2. Personalization and memory: your history is in the prompt
ChatGPT's memory carries facts from earlier conversations — budget, sizes, brands mentioned, a stated preference for natural fibers — into later answers, per OpenAI's Memory FAQ, and custom instructions add explicit standing preferences on top. Two shoppers asking the identical question are, from the model's side, asking different questions. A loyal customer's ChatGPT may keep recommending you for reasons that have nothing to do with what new shoppers see — which makes testing your own visibility from your own account roughly as reliable as Googling yourself while logged in circa 2012.
3. Live retrieval: the sources change under the answer
For buying-intent queries, ChatGPT searches the web and composes from what it retrieves. Retrieval is its own moving part: which queries the engine fans out (query fan-out), which pages the index serves at that moment, which fetches succeed. Run the same prompt during a competitor's press cycle and the retrieved set shifts. The selection step is narrow — a handful of sources per answer — so small retrieval shifts swing who gets named.
4. Geography and language
Shopping answers skew toward regionally available retailers, local-language sources, and market-specific pricing. A US team and an EU team comparing screenshots are sampling two different geographies of the same distribution, before any of the other four mechanisms apply.
5. Model routing and experiments
"ChatGPT" is several models behind one text box: plan tiers differ, requests route to different family members by load and complexity, and vendors run experiments continuously. Which model answered is invisible to the user and material to the answer. The same applies across the other engines — Perplexity routes across frontier models explicitly — which is one more reason per-engine measurement beats cross-engine anecdotes.
What this means for brands: visibility is a distribution
Put the five together and the question "does my product show up in ChatGPT?" has no yes/no answer. The honest object is a rate: across N sampled answers to relevant buying-intent prompts, the product appeared in K, at an average position of P. A screenshot is one draw from that distribution. It can demo the problem to a board; it cannot measure anything, and acting on single draws produces the failure mode where teams "fix" pages that were never broken and celebrate wins that were noise.
The measurement discipline follows directly. Sample repeatedly — the same prompts, many runs, on a schedule. Hold the prompt set constant so movement means something (prompt tracking). Read per engine, since each engine has its own variance profile. And separate drift from trend: a citation rate moving from 70% to 64% for one week is sampling noise; three consecutive weeks of decline on one product and one engine is a signal worth a PDP review — the reading discipline covered under LLM visibility.