How to Test Whether Your Product Descriptions Influence AI Shopping Recommendations

Q: Do I need access to the LLM to test this?

No. It is a black-box test: you only change retrieved content (the product description) and observe the generated recommendations.

Q: What should I measure?

Promotion Success Rate at Top-1/3/5: whether your product is promoted into the top of the generated ranking, compared with and without the change.

Q: How many models should I test on?

Several. Effects vary by model, so do not rely on one; the paper evaluates GPT-4o, Gemini-2.5, Claude-4, and Grok-3.

Q: Is there a ready-made benchmark?

Yes - ProductBench: 15 product categories, 200 products each, paired with top-10 recommendations from Amazon's search interface.

Want to test whether your product descriptions influence AI shopping recommendations? As shoppers move from scrolling ranked link lists to asking an assistant “what should I buy?”, the question that matters is no longer just “do I rank on the results page” but “does the AI put my product in the short list it recommends?” This guide gives you a concrete, repeatable way to measure exactly that — the definitions you need, the three content levers to try, a clean experiment protocol, a results table for calibration, common pitfalls, and an FAQ. The method and benchmark come from our paper, CORE (arXiv:2602.03608).

Why this matters

LLM-based search and generative engines do not simply return ten blue links. They retrieve candidate content and then synthesize a small, ordered set of recommendations. That means two things for a seller or creator. First, being retrieved is necessary but not sufficient — a product can be retrieved and still never surface in the final answer. Second, the text the engine retrieves about your product (its description, specs, and reviews) is a lever you actually control, and it can shift where you land in the generated ranking. Testing this rigorously tells you whether, and how much, your descriptions move the needle.

Key definitions

AI shopping recommendation — a short, ordered set of products an LLM-based search system recommends in its generated answer.
Retrieved content — the product text (descriptions, specs, reviews) the search engine pulls in before the LLM writes its recommendation. This is the lever you control.
Optimization content — text added to the retrieved content to influence the ranking. CORE studies three types (detailed below).
Promotion Success Rate (PSR) — how often the target product is promoted into the top-K (Top-1/3/5) of the generated ranking.
Baseline vs treatment — the recommendation outcome before your change (baseline) vs after it (treatment); the difference is the effect.
ProductBench — our benchmark built for exactly this measurement (below).

The three content levers (optimization-content types)

CORE studies three distinct kinds of content you can add to what the engine retrieves. Try them separately so you can attribute any effect:

String-based — compact, targeted textual additions to the description.
Reasoning-based — content that surfaces comparative reasoning the model can use when it decides an order (e.g. clearly stated, factual trade-offs).
Review-based — content framed around review-style evidence.

Keep all additions truthful to the product; the goal of the test is to measure influence, not to misrepresent the item.

How to test it, step by step

Define the target and query, and fix the competitor set. Pick the target product and the shopping query (e.g. “best wireless earbuds for running”), and decide the fixed set of competing products you will compare against. Holding competitors fixed is what makes the test interpretable.
Collect a baseline. Record the baseline recommendation set — for example the top-10 from a real shopping interface, the way ProductBench uses Amazon’s top-10 — before you change anything.
Vary only the retrieved content. Change just the product description / retrieved content using one optimization-content type at a time (string, reasoning, or review). Change one thing per run so the effect is attributable.
Query several search-enabled LLMs, with repeats. Ask the same shopping query to multiple models and record where your product appears each time. Repeat each condition several times, because model outputs are sampling-noisy.
Measure promotion and report the delta. Compute Promotion Success Rate @Top-1/3/5 for baseline vs treatment, per model, and report the difference (treatment minus baseline) rather than the absolute rate.

How to keep the test valid (pitfalls)

Change one variable at a time. If you alter the description and the query and the competitor set together, you cannot attribute the effect.
Use repeats. A single run is a noisy sample; average several runs per condition.
Test more than one model. Effects differ across models (the paper uses four), so a result on one engine may not transfer.
Trust deltas, not absolute numbers. Absolute citation/recommendation rates vary with prompt and model; the with-vs-without difference is the trustworthy signal.

Reference results (from CORE)

For calibration, here is what CORE reports on ProductBench — useful targets when you run your own test:

Metric (avg across 15 categories)	CORE
Promotion Success Rate @Top-5	91.4%
Promotion Success Rate @Top-3	86.6%
Promotion Success Rate @Top-1	80.3%
LLMs evaluated	GPT-4o, Gemini-2.5, Claude-4, Grok-3
Benchmark	ProductBench — 15 categories × 200 products, Amazon top-10

CORE reports outperforming existing ranking-manipulation methods while preserving the fluency of the optimized content. Treat these as upper-reference points from a controlled study, not guarantees for a single product.

How this relates to GEO and SEO

Testing whether descriptions change ranking is distinct from getting included at all (the focus of Generative Engine Optimization) and from classic SEO (ranking pages in link lists). If you want the side-by-side, see our comparison: The Newest Techniques for Improving Visibility in LLM-Based Search: SEO vs GEO vs CORE.

FAQ

Do I need access to the LLM to test this? No — it is a black-box test: you only change retrieved content and observe the generated recommendations.

What should I measure? Promotion Success Rate @Top-1/3/5 — whether your product is promoted into the top of the generated ranking, with vs without the change.

How many models should I test on? Several — effects vary by model, so don’t rely on one; the paper evaluates GPT-4o, Gemini-2.5, Claude-4, and Grok-3.

How do I know the effect is real and not noise? Compare baseline vs treatment, repeat runs, and hold the query and competitor set fixed; report the delta rather than an absolute rate.

Is there a ready-made benchmark? Yes — ProductBench (15 categories × 200 products, with Amazon top-10 recommendations).

Which LLMs did the paper test? GPT-4o, Gemini-2.5, Claude-4, and Grok-3.

Read the paper

Controlling Output Rankings in Generative Engines for LLM-based Search (CORE) — arXiv:2602.03608. For the broader question of being included in generated answers (vs. ranking position), see GEO (arXiv:2311.09735).