LLM Optimisation
AEO Needs Testing, Not Hopes: Why Control-vs-Test Is Non-Negotiable
Short version: In Answer Engine Optimisation (AEO), guesses don’t move revenue—experiments do. A hundred “answers” you publish without a control tell you nothing. A hundred you publish with a proper control, repeated across cycles, tell you what actually causes lifts in Share of Model, traffic from AI, and sell-through.
2 November 2025
15 min read
Why AEO must be tested like a product change
1) Models change daily.
LLMs and retailer assistants (Rufus, ChatGPT, Gemini, Perplexity) ship silent updates. Without controls, you’ll attribute a platform-wide uplift (or drop) to your content tweaks.
2) Seasonality overwhelms anecdotes.
Category demand swings with weather, events, and payday cycles. If you don’t hold out similar SKUs/queries as a control, you’re measuring the season—not your work.
3) Answer inclusion is binary.
In AEO the win is getting named inside the answer. Small structural tweaks (attributes, proofs, policy clarity) can tip inclusion. You need evidence that your change, not noise, flipped the outcome.
4) Confounders are everywhere.
Retail media, price changes, stock, competitor edits, PR hits—each can move results. Controls help isolate the content/signal effect from everything else.
5) Reproducibility beats luck.
One win could be randomness. Wins that repeat across prompt sets, models, and weeks are causal and scalable.
The experiment you actually need
Define the unit
For discovery assistants: the prompt (question) is the unit.
For Amazon Rufus/PDP: the SKU (or SKU-prompt pair) is the unit.
Create two comparable groups
Control: no change.
Test: apply your AEO changes (attributes clarified, proofs added, FAQ/Q&A seeded, on-image spec overlays, policy blocks).
Match by price band, review count, lifecycle, and current visibility.
Pre-register the outcomes
Primary: Share of Model (answer inclusion/mention), answer position (primary vs “also consider”).
Secondary: assistant-referred traffic, marketplace add-to-cart, ordered units, CVR.
Guardrails: returns rate, policy flags, content compliance.
Run enough samples
Don’t test “a few” prompts. Ship 100 in control vs 100 in test (or as many as makes sense in your niche) to get a stable read.
Repeat in multiple cycles (e.g., weekly for 4–6 weeks). Model drift is real; durability matters.
Randomise & stagger
Randomly assign prompts/SKUs to control vs test.
Stagger rollout across days to blunt day-of-week effects.
Hold prices and ads steady (where possible)
If media must change, change it equally for both arms or record it precisely and adjust in analysis.
Measure pre → post
Baseline 1–2 weeks, ship changes, measure 2–4 weeks.
Focus on delta vs control (difference-in-differences), not raw lifts.
What to change in the “test” arm (AEO specifics)
Attributes made explicit: units, thresholds, certifications (e.g., SPF 50 UVA/UVB; ESD ASTM level; noise ≤ 60 dB).
Proof objects: lab/test durations, standards, third-party citations (where allowed).
Policy clarity: warranty/returns, safety, allergens—crawlable and consistent.
FAQ/Q&A: seed top objections from Lexym-style clusters; keep answers short and quote-ready.
On-image callouts: 3–5 decisive specs mirrored in bullets (avoid contradictions).
Consistency pass: align D2C ↔ marketplace ↔ press to remove conflicts.
How to know it “worked”
Answer presence rises in test, not control.
Your brand/SKU is mentioned more often across the test prompt set.
Assistant-referred traffic improves in test.
Even if small in volume, intent should be higher (better CVR).
Marketplace sell-through ticks up on test SKUs.
Add-to-cart and ordered units improve vs control after content ships.
Language echo in reviews/Q&A.
Shoppers repeat your terms and proof—evidence your framing is landing.
If these don’t move vs control, revert, iterate, and retest.
Reproduce it (or it didn’t happen)
Across models: rerun in ChatGPT, Gemini, Claude, Perplexity, Rufus.
Across categories: take the same play to a second line (e.g., sunscreen → aftersun).
Across time: re-measure a few weeks later to confirm durability.
Consistent wins across surfaces and weeks = a playbook, not a fluke.
Common testing mistakes (and how to avoid them)
Shipping multiple changes at once → Bundle related edits, but log each change so you can replicate the winners.
Changing prices/ads mid-test → If unavoidable, mirror changes across arms or exclude those windows.
Too few prompts/SKUs → Underpowered tests produce fake negatives (or fake positives).
Declaring victory on screenshots → Track structured metrics; store evidence.
Not versioning content → Keep a canonical spec sheet and change log to roll back quickly.
A simple AEO test plan you can copy
Build a 200-prompt set (or 200 SKU-prompt pairs).
Randomly split 50/50 control vs test.
Baseline for 2 weeks.
Ship one AEO package to test (attributes + proofs + FAQ + images + policy).
Measure 4 weeks; compute difference-in-differences on Share of Model, assistant traffic, and sell-through.
Replicate on a second category or model the next month.
Bottom line
AEO is a science problem, not a vibes problem. 100 untested answers achieve nothing.
100 tested answers, reproduced multiple times, become a growth engine.
Control vs test is how you separate signal from noise, turn learnings into a playbook, and compound results across AI answers and retail sales.
1) Models change daily.
LLMs and retailer assistants (Rufus, ChatGPT, Gemini, Perplexity) ship silent updates. Without controls, you’ll attribute a platform-wide uplift (or drop) to your content tweaks.
2) Seasonality overwhelms anecdotes.
Category demand swings with weather, events, and payday cycles. If you don’t hold out similar SKUs/queries as a control, you’re measuring the season—not your work.
3) Answer inclusion is binary.
In AEO the win is getting named inside the answer. Small structural tweaks (attributes, proofs, policy clarity) can tip inclusion. You need evidence that your change, not noise, flipped the outcome.
4) Confounders are everywhere.
Retail media, price changes, stock, competitor edits, PR hits—each can move results. Controls help isolate the content/signal effect from everything else.
5) Reproducibility beats luck.
One win could be randomness. Wins that repeat across prompt sets, models, and weeks are causal and scalable.
The experiment you actually need
Define the unit
For discovery assistants: the prompt (question) is the unit.
For Amazon Rufus/PDP: the SKU (or SKU-prompt pair) is the unit.
Create two comparable groups
Control: no change.
Test: apply your AEO changes (attributes clarified, proofs added, FAQ/Q&A seeded, on-image spec overlays, policy blocks).
Match by price band, review count, lifecycle, and current visibility.
Pre-register the outcomes
Primary: Share of Model (answer inclusion/mention), answer position (primary vs “also consider”).
Secondary: assistant-referred traffic, marketplace add-to-cart, ordered units, CVR.
Guardrails: returns rate, policy flags, content compliance.
Run enough samples
Don’t test “a few” prompts. Ship 100 in control vs 100 in test (or as many as makes sense in your niche) to get a stable read.
Repeat in multiple cycles (e.g., weekly for 4–6 weeks). Model drift is real; durability matters.
Randomise & stagger
Randomly assign prompts/SKUs to control vs test.
Stagger rollout across days to blunt day-of-week effects.
Hold prices and ads steady (where possible)
If media must change, change it equally for both arms or record it precisely and adjust in analysis.
Measure pre → post
Baseline 1–2 weeks, ship changes, measure 2–4 weeks.
Focus on delta vs control (difference-in-differences), not raw lifts.
What to change in the “test” arm (AEO specifics)
Attributes made explicit: units, thresholds, certifications (e.g., SPF 50 UVA/UVB; ESD ASTM level; noise ≤ 60 dB).
Proof objects: lab/test durations, standards, third-party citations (where allowed).
Policy clarity: warranty/returns, safety, allergens—crawlable and consistent.
FAQ/Q&A: seed top objections from Lexym-style clusters; keep answers short and quote-ready.
On-image callouts: 3–5 decisive specs mirrored in bullets (avoid contradictions).
Consistency pass: align D2C ↔ marketplace ↔ press to remove conflicts.
How to know it “worked”
Answer presence rises in test, not control.
Your brand/SKU is mentioned more often across the test prompt set.
Assistant-referred traffic improves in test.
Even if small in volume, intent should be higher (better CVR).
Marketplace sell-through ticks up on test SKUs.
Add-to-cart and ordered units improve vs control after content ships.
Language echo in reviews/Q&A.
Shoppers repeat your terms and proof—evidence your framing is landing.
If these don’t move vs control, revert, iterate, and retest.
Reproduce it (or it didn’t happen)
Across models: rerun in ChatGPT, Gemini, Claude, Perplexity, Rufus.
Across categories: take the same play to a second line (e.g., sunscreen → aftersun).
Across time: re-measure a few weeks later to confirm durability.
Consistent wins across surfaces and weeks = a playbook, not a fluke.
Common testing mistakes (and how to avoid them)
Shipping multiple changes at once → Bundle related edits, but log each change so you can replicate the winners.
Changing prices/ads mid-test → If unavoidable, mirror changes across arms or exclude those windows.
Too few prompts/SKUs → Underpowered tests produce fake negatives (or fake positives).
Declaring victory on screenshots → Track structured metrics; store evidence.
Not versioning content → Keep a canonical spec sheet and change log to roll back quickly.
A simple AEO test plan you can copy
Build a 200-prompt set (or 200 SKU-prompt pairs).
Randomly split 50/50 control vs test.
Baseline for 2 weeks.
Ship one AEO package to test (attributes + proofs + FAQ + images + policy).
Measure 4 weeks; compute difference-in-differences on Share of Model, assistant traffic, and sell-through.
Replicate on a second category or model the next month.
Bottom line
AEO is a science problem, not a vibes problem. 100 untested answers achieve nothing.
100 tested answers, reproduced multiple times, become a growth engine.
Control vs test is how you separate signal from noise, turn learnings into a playbook, and compound results across AI answers and retail sales.