41 lines (34 loc) · 2.74 KB

Benchmarks and studies

Tasks, benchmark cards, official pack, studies, and reproduction.

Tasks and cards

Document	Description
Benchmarks	Harness, tasks (A–H), metrics.
Benchmark card	Scope, tasks, baselines.
Coordination benchmark card	Coord scale/risk (Task G/H).
Evaluation checklist	Baseline status, when to regenerate, full command sequence.
Scale and operational limits	Scale configs and limits.
Throughput comparison	Throughput-focused comparison (throughput_sla, scripted baseline).
Prime Intellect Inference	Env vars, CLI smoke, top-6 sweep, cross-provider.
GCP Prime runner	Compute Engine VM: install, background runs, fetch results.
OpenHands SWE-bench with Prime	Minimal OpenHands SWE-bench runbook with Prime preflight checks.
Benchmark results pipeline	From coordination sweeps to presentation bundles.
Hospital lab key metrics	Metrics that matter for hospital labs; SOTA leaderboard (main vs full), method-class comparison, run metadata, artifact paths, and coordination graphs in the UI bundle.
Uncertainty quantification	Epistemic vs aleatoric; metric mapping.
Generalization and limits	Tested scope, known limits, and comparison with other benchmarks.

Official pack and studies

Document	Description
Official benchmark pack	v0.1/v0.2 and run commands.
Hospital lab full pipeline	Full-pipeline script and orchestration.
Hospital lab full pipeline results	Example results report (regenerate runs as needed).
Studies and plots	Study runner, make-plots.
Coordination studies	Coordination study runner and Pareto.
LLM Coordination Protocol	LLM coordination protocol.

Reproducibility and paper

Document	Description
Determinism contract	Deterministic pipeline guarantee, RNG, canonical write, cross-version limits.
Reproduce	Minimal results and figures.
Paper claims	Paper claims regression and snapshot.
Paper provenance	Figures, tarball, commands.