feat(benchmarks): unified multi-backend benchmark suite (#5) by AsherNoble · Pull Request #34 · marqov-dev/marqov-sdk

AsherNoble · 2026-06-07T08:41:31Z

Summary

Closes #5.

Adds benchmarks/suite.py — a unified harness that runs a fixed set of reference
circuits (Bell, 3-qubit GHZ, deterministic depth-5 random) against any configured
executor and prints the comparison table documented in CONTRIBUTING.md §5.

Zero credentials out of the box — python benchmarks/suite.py --executor local --shots 1000 --seed 1234.
Executor-agnostic core — run_suite() accepts any name → BaseExecutor mapping,
so cloud backends can be benchmarked programmatically; the CLI wires only the
zero-credential local executor.
Exact §5 output — bordered markdown table; JSON outcomes preserved in
count-descending order (no sort_keys, which would corrupt the ranking).
Atomic, per-backend error handling — if any circuit raises, the whole backend
is skipped (no partial rows leak), the backend and failing circuit are logged to
stderr, and the run continues. Exits non-zero only when every backend fails, so CI
can flag a fully broken run.
Reproducible — --seed seeds both the random circuit and the local sampler;
exact outcome counts are pinned in tests against the locked numpy / quantumflow versions.

Review methodology

Before submission this branch was subjected to an adversarial multi-agent review. A set
of independent reviewer agents each examined the implementation from a distinct,
non-overlapping angle:

Acceptance-criteria conformance — every requirement in issue feat(benchmarks): unified multi-backend benchmark harness #5 and
CONTRIBUTING.md §5 verified at the terminal, including a byte-for-byte check of the
rendered table against the documented format.
Adversarial input / edge-case fuzzing — malformed CLI arguments, degenerate
executors (empty counts, NaN/inf timings, tied outcomes), and determinism verified
across independent processes.
Code-quality and house-style audit — consistency with the SDK's conventions,
ruff and mypy --strict conformance, and idiom review against the surrounding code.
Mutation testing of the test suite — production lines were deliberately broken to
confirm a test fails. Six of eight mutations were caught by the pre-existing tests;
the two survivors revealed documented-but-undefended contracts.

The follow-up commit closes those two gaps:

The "all backends fail → non-zero exit" CLI contract (previously, hardcoding a 0
exit passed every test).
The shots column records result.shots — what the backend actually ran — rather
than the requested value, so a shot-capped hardware run is never silently misreported.

It also adds a CONTRIBUTING.md §5 scope note (shot-fidelity and queue-overhead
comparisons are out of scope; device queue status is exposed separately via
BaseExecutor.get_status()) and a comment explaining why _build_executor does not
route through ExecutorFactory.

Type of change

Other (describe): benchmark tooling + documentation

Testing

I ran pytest tests/ -v and tests pass — 340 passed, 13 skipped
Benchmark suite: 22 tests pass; verified end-to-end against the real
LocalExecutor (no credentials). ruff and mypy --strict clean.

Test details: parametrised top_outcomes ranking/tie-break cases, byte-for-byte §5
format match, atomic skip-and-log (including failure on a later circuit, proving no
partial rows leak), seeded reproducibility across independent runs, a pinned exact-count
regression guard, and the two contract tests above. Each newly added test was
mutation-verified to fail when its target line is broken.

Checklist

No hardcoded credentials or API keys
Handles the canonical gate set from CONTRIBUTING.md §1 — the random circuit is
restricted to H, X, Y, Z, S, T, CNOT, asserted by a test.

🤖 Reviewed and hardened with Claude Code

Implements issue marqov-dev#5. benchmarks/suite.py runs Bell, 3-qubit GHZ, and a deterministic depth-5 random circuit against any executor and prints a comparison table in the CONTRIBUTING.md §5 markdown format. Works out of the box with LocalExecutor — no credentials required. - run_suite() is executor-agnostic; the CLI wires up the zero-credential local executor and supports --shots/--seed for fully reproducible runs. - On executor error the whole backend is skipped atomically (no partial rows), the backend + failing circuit are logged to stderr, and the suite continues; exit is non-zero only when every backend fails. - top_3_outcomes are ordered by count descending (no JSON key re-sorting). - format_table renders the exact §5 markdown table and is empty-input safe. - Tests: parametrised top_outcomes, real-LocalExecutor end-to-end with pinned physics invariants + an exact-value regression guard, atomic skip-and-log, and CLI error paths. ruff + ruff format + mypy --strict clean. Closes marqov-dev#5 Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

These changes follow an adversarial multi-agent review of the benchmark suite. Independent reviewer agents each attacked the implementation from a distinct angle: acceptance-criteria conformance against issue marqov-dev#5, edge-case and adversarial-input fuzzing, a code-quality and house-style audit, and mutation testing of the existing test suite (deliberately breaking production lines to confirm a test fails). Six of eight mutations were already caught by the existing tests; the two survivors exposed documented-but-undefended contracts, closed here: - Add a test for the documented "all backends fail -> non-zero exit" CLI contract. It was undefended: returning 0 unconditionally passed every test. - Add a test that the shots column records result.shots (what the backend actually ran), not the requested value, so a shot-capped hardware run is never silently misreported. Also: - CONTRIBUTING.md section 5: note shot-fidelity/queue-overhead are out of scope for this harness; device queue status is available separately via BaseExecutor.get_status(). - suite.py: comment on _build_executor explaining why it bypasses ExecutorFactory (richer backends need provider config a zero-arg CLI can't supply). Full repo suite: 340 passed, 13 skipped. ruff and mypy --strict clean. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

AsherNoble and others added 2 commits June 6, 2026 23:20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(benchmarks): unified multi-backend benchmark suite (#5)#34

feat(benchmarks): unified multi-backend benchmark suite (#5)#34
AsherNoble wants to merge 2 commits into
marqov-dev:mainfrom
AsherNoble:feat/benchmark-suite

AsherNoble commented Jun 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

AsherNoble commented Jun 7, 2026

Summary

Review methodology

Type of change

Testing

Checklist

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant