feat(benchmarks): unified multi-backend benchmark suite (#5)#34
Open
AsherNoble wants to merge 2 commits into
Open
feat(benchmarks): unified multi-backend benchmark suite (#5)#34AsherNoble wants to merge 2 commits into
AsherNoble wants to merge 2 commits into
Conversation
Implements issue marqov-dev#5. benchmarks/suite.py runs Bell, 3-qubit GHZ, and a deterministic depth-5 random circuit against any executor and prints a comparison table in the CONTRIBUTING.md §5 markdown format. Works out of the box with LocalExecutor — no credentials required. - run_suite() is executor-agnostic; the CLI wires up the zero-credential local executor and supports --shots/--seed for fully reproducible runs. - On executor error the whole backend is skipped atomically (no partial rows), the backend + failing circuit are logged to stderr, and the suite continues; exit is non-zero only when every backend fails. - top_3_outcomes are ordered by count descending (no JSON key re-sorting). - format_table renders the exact §5 markdown table and is empty-input safe. - Tests: parametrised top_outcomes, real-LocalExecutor end-to-end with pinned physics invariants + an exact-value regression guard, atomic skip-and-log, and CLI error paths. ruff + ruff format + mypy --strict clean. Closes marqov-dev#5 Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
These changes follow an adversarial multi-agent review of the benchmark suite. Independent reviewer agents each attacked the implementation from a distinct angle: acceptance-criteria conformance against issue marqov-dev#5, edge-case and adversarial-input fuzzing, a code-quality and house-style audit, and mutation testing of the existing test suite (deliberately breaking production lines to confirm a test fails). Six of eight mutations were already caught by the existing tests; the two survivors exposed documented-but-undefended contracts, closed here: - Add a test for the documented "all backends fail -> non-zero exit" CLI contract. It was undefended: returning 0 unconditionally passed every test. - Add a test that the shots column records result.shots (what the backend actually ran), not the requested value, so a shot-capped hardware run is never silently misreported. Also: - CONTRIBUTING.md section 5: note shot-fidelity/queue-overhead are out of scope for this harness; device queue status is available separately via BaseExecutor.get_status(). - suite.py: comment on _build_executor explaining why it bypasses ExecutorFactory (richer backends need provider config a zero-arg CLI can't supply). Full repo suite: 340 passed, 13 skipped. ruff and mypy --strict clean. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Closes #5.
Adds
benchmarks/suite.py— a unified harness that runs a fixed set of referencecircuits (Bell, 3-qubit GHZ, deterministic depth-5 random) against any configured
executor and prints the comparison table documented in CONTRIBUTING.md §5.
python benchmarks/suite.py --executor local --shots 1000 --seed 1234.run_suite()accepts anyname → BaseExecutormapping,so cloud backends can be benchmarked programmatically; the CLI wires only the
zero-credential
localexecutor.count-descending order (no
sort_keys, which would corrupt the ranking).is skipped (no partial rows leak), the backend and failing circuit are logged to
stderr, and the run continues. Exits non-zero only when every backend fails, so CI
can flag a fully broken run.
--seedseeds both the random circuit and the local sampler;exact outcome counts are pinned in tests against the locked numpy / quantumflow versions.
Review methodology
Before submission this branch was subjected to an adversarial multi-agent review. A set
of independent reviewer agents each examined the implementation from a distinct,
non-overlapping angle:
CONTRIBUTING.md §5 verified at the terminal, including a byte-for-byte check of the
rendered table against the documented format.
executors (empty counts, NaN/inf timings, tied outcomes), and determinism verified
across independent processes.
ruffandmypy --strictconformance, and idiom review against the surrounding code.confirm a test fails. Six of eight mutations were caught by the pre-existing tests;
the two survivors revealed documented-but-undefended contracts.
The follow-up commit closes those two gaps:
0exit passed every test).
shotscolumn recordsresult.shots— what the backend actually ran — ratherthan the requested value, so a shot-capped hardware run is never silently misreported.
It also adds a CONTRIBUTING.md §5 scope note (shot-fidelity and queue-overhead
comparisons are out of scope; device queue status is exposed separately via
BaseExecutor.get_status()) and a comment explaining why_build_executordoes notroute through
ExecutorFactory.Type of change
Testing
pytest tests/ -vand tests pass — 340 passed, 13 skippedLocalExecutor(no credentials).ruffandmypy --strictclean.Test details: parametrised
top_outcomesranking/tie-break cases, byte-for-byte §5format match, atomic skip-and-log (including failure on a later circuit, proving no
partial rows leak), seeded reproducibility across independent runs, a pinned exact-count
regression guard, and the two contract tests above. Each newly added test was
mutation-verified to fail when its target line is broken.
Checklist
CONTRIBUTING.md §1— the random circuit isrestricted to
H, X, Y, Z, S, T, CNOT, asserted by a test.🤖 Reviewed and hardened with Claude Code