Reading Solver Benchmarks Like an Adversary: A Buyer's Checklist

A benchmark is a sales document until proven otherwise. That is not cynicism; it is the correct default. Every number a vendor shows you was chosen, and the interesting question is always what was not shown. Here is how to read an optimization benchmark the way a good due-diligence analyst does — assuming nothing, and asking the questions the slide was designed to skip.

The checklist, in one breath

Is it your workload? A speedup on a toy problem predicts nothing about your book.
Cold start or warm? The two measure different things and are routinely conflated.
What's the quality gap, and where are the losses? A fast wrong answer is not a win, and a benchmark with no losing cases is hiding them.

The single most useful habit when reading any performance claim is to ask what question the number actually answers, and then ask whether it is the question you care about. "Forty times faster" answers the question "faster at what, on what, measured how?" — and until those blanks are filled, it answers nothing about whether the thing will clear your book by the open. Vendors know this, which is why the blanks are so often left artfully empty. Your job is to fill them in, out loud, and watch what happens to the number.

1. Is it a matched workload?

The first and most important question is whether the benchmark was run on a problem that resembles yours. A solver can be dazzling on a small, clean, synthetic problem and ordinary or worse on a large, messy, real one — and the gap between those two worlds is exactly where your money lives. A credible benchmark uses a realistic universe, realistic constraints, realistic costs, and realistic tax rules. An incredible one uses whatever makes the number biggest. If you cannot tell which universe was used, assume the flattering one was chosen, because it was.

The strongest form of this test is to stop reading the vendor's benchmark entirely and run your own — your universe, your constraints, your current production system as the baseline. A matched-workload pilot is the only benchmark that cannot be gamed, because you control every variable.

2. Cold start or repeated run?

Performance on the first solve of a fresh problem and performance on the thousandth solve of a recurring one are different measurements, and conflating them is the oldest trick in the book. A system can look slow cold and fast warm, or vice versa, and both numbers can be true and both can be misleading depending on which one matches your actual usage. If your workflow re-solves similar problems every night, the repeated-run number is the honest one; if you face genuinely fresh problems each time, the cold-start number is. A benchmark that does not tell you which regime it measured is not yet a benchmark.

Figure 1 — Speed without quality is a mirage

Solver A — runtime

very fast

Solver A — quality gap

large ✗

Solver B — runtime

fast

Solver B — quality gap

tiny ✓

Solver A "wins" on the runtime slide and loses where it matters. Always demand the quality gap next to the speed, on the same problem.

3. What is the quality gap — and against what?

Speed means nothing without a quality bar, because the easiest way to be fast is to be wrong. Any serious benchmark reports, alongside runtime, how far the solver's answer sat from a trustworthy reference — ideally an exact solver on a problem small enough to solve exactly — and states the pass/fail threshold before interpreting the result. If a benchmark shows you speed and is silent on quality, the silence is the finding. Ask: gap against what reference, measured how, and what tolerance was agreed in advance?

4. Where are the losing cases?

This is the question that separates honest evidence from marketing, and it is the easiest to ask: show me where you lose. Every real system loses somewhere — a regime, a problem size, a constraint pattern where a competitor does better or an exact method wins. A benchmark that contains no losing cases has not discovered that its product is perfect; it has filtered them out, and a vendor who filters out the losses will filter out other inconvenient truths too. Paradoxically, the presence of honest losses is the strongest signal that the wins are real.

5. Does the number trace to a recorded run?

Finally, ask whether each figure traces to a single, recorded, reproducible run, or whether it is an average of averages, a best-of-many, or a rounded-up composite. Stitched and cherry-picked numbers are the norm in this industry, and they fall apart the moment you ask for the underlying artifact. "Can I see the run that produced this?" is a devastatingly effective question, because an honest benchmark has an answer and a dishonest one changes the subject.

The only benchmark you can't game is your own. Run a matched-workload pilot: your data, your constraints, your current system as the baseline, a pass/fail metric set before we start, and every losing case shown.

Request a pilot →

None of these questions are hostile; they are simply the questions a benchmark should already have answered. A vendor confident in their evidence will welcome every one of them and have the artifacts ready. A vendor who gets uncomfortable when you ask where the losses are has just told you something more valuable than any slide. Read accordingly.

References & further reading

Asymmetry Computing, How to read portfolio optimization benchmarks without being misled.
Asymmetry Computing, What a matched-workload pilot should prove.
Asymmetry Computing, Benchmarks & evidence — measured outcomes with qualifiers and losses kept in.

Keep reading

How to read benchmarks without being misled → Determinism and the coming audit standard → A practical taxonomy of optimization methods → GPU computing in quant finance →

Reading solver benchmarks like an adversary: a buyer's checklist