Skip to content

feat(benchmarks): unified multi-backend benchmark suite (#5)#34

Open
AsherNoble wants to merge 2 commits into
marqov-dev:mainfrom
AsherNoble:feat/benchmark-suite
Open

feat(benchmarks): unified multi-backend benchmark suite (#5)#34
AsherNoble wants to merge 2 commits into
marqov-dev:mainfrom
AsherNoble:feat/benchmark-suite

Conversation

@AsherNoble

Copy link
Copy Markdown

Summary

Closes #5.

Adds benchmarks/suite.py — a unified harness that runs a fixed set of reference
circuits (Bell, 3-qubit GHZ, deterministic depth-5 random) against any configured
executor and prints the comparison table documented in CONTRIBUTING.md §5.

  • Zero credentials out of the boxpython benchmarks/suite.py --executor local --shots 1000 --seed 1234.
  • Executor-agnostic corerun_suite() accepts any name → BaseExecutor mapping,
    so cloud backends can be benchmarked programmatically; the CLI wires only the
    zero-credential local executor.
  • Exact §5 output — bordered markdown table; JSON outcomes preserved in
    count-descending order (no sort_keys, which would corrupt the ranking).
  • Atomic, per-backend error handling — if any circuit raises, the whole backend
    is skipped (no partial rows leak), the backend and failing circuit are logged to
    stderr, and the run continues. Exits non-zero only when every backend fails, so CI
    can flag a fully broken run.
  • Reproducible--seed seeds both the random circuit and the local sampler;
    exact outcome counts are pinned in tests against the locked numpy / quantumflow versions.

Review methodology

Before submission this branch was subjected to an adversarial multi-agent review. A set
of independent reviewer agents each examined the implementation from a distinct,
non-overlapping angle:

  1. Acceptance-criteria conformance — every requirement in issue feat(benchmarks): unified multi-backend benchmark harness #5 and
    CONTRIBUTING.md §5 verified at the terminal, including a byte-for-byte check of the
    rendered table against the documented format.
  2. Adversarial input / edge-case fuzzing — malformed CLI arguments, degenerate
    executors (empty counts, NaN/inf timings, tied outcomes), and determinism verified
    across independent processes.
  3. Code-quality and house-style audit — consistency with the SDK's conventions,
    ruff and mypy --strict conformance, and idiom review against the surrounding code.
  4. Mutation testing of the test suite — production lines were deliberately broken to
    confirm a test fails. Six of eight mutations were caught by the pre-existing tests;
    the two survivors revealed documented-but-undefended contracts.

The follow-up commit closes those two gaps:

  • The "all backends fail → non-zero exit" CLI contract (previously, hardcoding a 0
    exit passed every test).
  • The shots column records result.shots — what the backend actually ran — rather
    than the requested value, so a shot-capped hardware run is never silently misreported.

It also adds a CONTRIBUTING.md §5 scope note (shot-fidelity and queue-overhead
comparisons are out of scope; device queue status is exposed separately via
BaseExecutor.get_status()) and a comment explaining why _build_executor does not
route through ExecutorFactory.

Type of change

  • Other (describe): benchmark tooling + documentation

Testing

  • I ran pytest tests/ -v and tests pass — 340 passed, 13 skipped
  • Benchmark suite: 22 tests pass; verified end-to-end against the real
    LocalExecutor (no credentials). ruff and mypy --strict clean.

Test details: parametrised top_outcomes ranking/tie-break cases, byte-for-byte §5
format match, atomic skip-and-log (including failure on a later circuit, proving no
partial rows leak), seeded reproducibility across independent runs, a pinned exact-count
regression guard, and the two contract tests above. Each newly added test was
mutation-verified to fail when its target line is broken.

Checklist

  • No hardcoded credentials or API keys
  • Handles the canonical gate set from CONTRIBUTING.md §1 — the random circuit is
    restricted to H, X, Y, Z, S, T, CNOT, asserted by a test.

🤖 Reviewed and hardened with Claude Code

AsherNoble and others added 2 commits June 6, 2026 23:20
Implements issue marqov-dev#5. benchmarks/suite.py runs Bell, 3-qubit GHZ, and a
deterministic depth-5 random circuit against any executor and prints a
comparison table in the CONTRIBUTING.md §5 markdown format. Works out of the
box with LocalExecutor — no credentials required.

- run_suite() is executor-agnostic; the CLI wires up the zero-credential local
  executor and supports --shots/--seed for fully reproducible runs.
- On executor error the whole backend is skipped atomically (no partial rows),
  the backend + failing circuit are logged to stderr, and the suite continues;
  exit is non-zero only when every backend fails.
- top_3_outcomes are ordered by count descending (no JSON key re-sorting).
- format_table renders the exact §5 markdown table and is empty-input safe.
- Tests: parametrised top_outcomes, real-LocalExecutor end-to-end with pinned
  physics invariants + an exact-value regression guard, atomic skip-and-log,
  and CLI error paths. ruff + ruff format + mypy --strict clean.

Closes marqov-dev#5

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
These changes follow an adversarial multi-agent review of the benchmark
suite. Independent reviewer agents each attacked the implementation from a
distinct angle: acceptance-criteria conformance against issue marqov-dev#5, edge-case
and adversarial-input fuzzing, a code-quality and house-style audit, and
mutation testing of the existing test suite (deliberately breaking production
lines to confirm a test fails). Six of eight mutations were already caught by
the existing tests; the two survivors exposed documented-but-undefended
contracts, closed here:

- Add a test for the documented "all backends fail -> non-zero exit" CLI
  contract. It was undefended: returning 0 unconditionally passed every test.
- Add a test that the shots column records result.shots (what the backend
  actually ran), not the requested value, so a shot-capped hardware run is
  never silently misreported.

Also:
- CONTRIBUTING.md section 5: note shot-fidelity/queue-overhead are out of
  scope for this harness; device queue status is available separately via
  BaseExecutor.get_status().
- suite.py: comment on _build_executor explaining why it bypasses
  ExecutorFactory (richer backends need provider config a zero-arg CLI can't
  supply).

Full repo suite: 340 passed, 13 skipped. ruff and mypy --strict clean.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat(benchmarks): unified multi-backend benchmark harness

1 participant