Tasks, benchmark cards, official pack, studies, and reproduction.
| Document | Description |
|---|---|
| Benchmarks | Harness, tasks (A–H), metrics. |
| Benchmark card | Scope, tasks, baselines. |
| Coordination benchmark card | Coord scale/risk (Task G/H). |
| Evaluation checklist | Baseline status, when to regenerate, full command sequence. |
| Scale and operational limits | Scale configs and limits. |
| Throughput comparison | Throughput-focused comparison (throughput_sla, scripted baseline). |
| Prime Intellect Inference | Env vars, CLI smoke, top-6 sweep, cross-provider. |
| GCP Prime runner | Compute Engine VM: install, background runs, fetch results. |
| OpenHands SWE-bench with Prime | Minimal OpenHands SWE-bench runbook with Prime preflight checks. |
| Benchmark results pipeline | From coordination sweeps to presentation bundles. |
| Hospital lab key metrics | Metrics that matter for hospital labs; SOTA leaderboard (main vs full), method-class comparison, run metadata, artifact paths, and coordination graphs in the UI bundle. |
| Uncertainty quantification | Epistemic vs aleatoric; metric mapping. |
| Generalization and limits | Tested scope, known limits, and comparison with other benchmarks. |
| Document | Description |
|---|---|
| Official benchmark pack | v0.1/v0.2 and run commands. |
| Hospital lab full pipeline | Full-pipeline script and orchestration. |
| Hospital lab full pipeline results | Example results report (regenerate runs as needed). |
| Studies and plots | Study runner, make-plots. |
| Coordination studies | Coordination study runner and Pareto. |
| LLM Coordination Protocol | LLM coordination protocol. |
| Document | Description |
|---|---|
| Determinism contract | Deterministic pipeline guarantee, RNG, canonical write, cross-version limits. |
| Reproduce | Minimal results and figures. |
| Paper claims | Paper claims regression and snapshot. |
| Paper provenance | Figures, tarball, commands. |