A new SparkToro study is the clearest warning yet that “AI visibility” is not a stable thing you can measure with a single prompt and a neat little ranking. The outputs you get from ChatGPT, Google’s AI experiences and similar systems are probabilistic. Same question. Same model. Different list. Different order. Different length.
SparkToro ran thousands of real world tests (600 volunteers, 12 prompts, 2,961 runs) and the results were chaotic. In their data, the chance of getting the same list twice across 100 runs is under 1 in 100 for some systems. Getting the same order is closer to 1 in 1,000.
[Full report here](https://sparktoro.com/blog/new-research-ais-are-highly-inconsistent-when-recommending-brands-or-products-marketers-should-take-care-when-tracking-ai-visibility/)
If you are using a tracking tool that implies “you rank #3 in ChatGPT for X” you should treat that claim with suspicion.
**Why this happens: LLMs are not a search results page**
Even when a model is grounded with retrieval, the final response is still a statistical selection process. You are seeing a “distribution” of plausible answers, not a deterministic ordering of best answers. SparkToro describes it like a lottery of candidates and argues the interfaces should disclose that lists are randomised.
That maps to what we see operationally in AI commerce and Rufus style discovery: one run is a sample, not the truth.
**The practical implication: volume beats screenshots**
If outputs are probabilistic, then measurement has to look more like market research and less like SEO rank tracking.
**What matters most is:**
High volume of queries per topic, because you need enough samples to smooth the randomness. One run tells you almost nothing.
Trendlines over time, because direction is the signal. If your share of appearance is rising week over week, that is meaningful. If you are obsessing over a single day’s position, that is noise.
Topic coverage, because prompt choice bias is huge. SparkToro also highlights how diverse real prompts are, which means a tiny prompt set can mislead you fast.
**Why individual model tracking often misleads**
A lot of teams are currently doing some version of this: pick 10 prompts, run them once, record brand mentions, make decisions.
**That approach breaks in three ways:**
Sampling error: one run might exclude you purely due to randomness.
False precision: ordered lists feel like rankings but the ordering is often unstable.
Over interpretation: teams change budgets or content based on tiny movements that are not statistically real.
This is how you end up paying for dashboards that produce confident looking charts built on shaky foundations.
**A better way to measure AI visibility**
If you want metrics that actually correlate with performance, build them around “share” and “momentum”.
**Build a topic library not a prompt list**
Group prompts into intent clusters. Think categories, use cases and attributes. Example: “electrolyte drink” splits into hydration, endurance, low sugar, taste, stomach friendly, caffeine free, vegan, travel, pregnancy safe, kids safe.
Measure across the cluster, not one prompt.
**Run enough samples to get stability**
Repeated runs are not “gaming the system”. They are how you estimate probability. The more volatile the topic, the more runs you need.
**Track share of appearance not rank**
For each topic cluster, track:
Mention rate: % of runs where the brand appears
Position banding: top 3 vs top 10 vs not mentioned (coarse bands only)
Co mention context: which competitors appear alongside you
**Only trust changes that persist**
Treat any single week spike as suspect. Look for sustained movement across multiple periods, ideally with the same methodology and the same topic set.
**Separate measurement from optimisation**
Measurement tells you where you are winning or losing. Optimisation is then about increasing the probability you are selected.
**So what actually improves the probability of being recommended**
This study is a measurement wake up call, but it also reinforces the direction of travel for optimisation.
If models are choosing from a “candidate pool”, your job is to be a high confidence candidate.
That usually means:
Clear entity signals: consistent naming, consistent product taxonomy and clean structured data on your site and major retailers.
Repeatable descriptors: the phrases you want associated with you need to exist across third party coverage, partner sites, product detail pages and expert content. SparkToro has written previously about how consistent bios and repeated descriptors show up in AI answers.
Trusted corroboration: reviews, comparisons and credible publisher mentions matter because they shape the corpus and the retrieval layer.
In other words, you do not win by chasing one magic prompt. You win by building an information footprint that makes you the obvious choice across many prompts.
__What Lmo7 takes from this__
If you are a brand leader, the takeaway is not “AI visibility is pointless”. The takeaway is “AI visibility needs proper measurement”.
We treat AI answers as probabilistic surfaces. That means our tracking is designed around high volume sampling and trend detection, not fragile ranking claims. Then we connect those trends back to the levers brands can actually pull: product data, retail content, authoritative coverage and semantic alignment across the web.
If you want to pressure test your current AI visibility reporting, ask one question: would this still be true if we re ran it 100 times next week?