Skip to content

Latest commit

 

History

History
41 lines (34 loc) · 2.74 KB

File metadata and controls

41 lines (34 loc) · 2.74 KB

Benchmarks and studies

Tasks, benchmark cards, official pack, studies, and reproduction.

Tasks and cards

Document Description
Benchmarks Harness, tasks (A–H), metrics.
Benchmark card Scope, tasks, baselines.
Coordination benchmark card Coord scale/risk (Task G/H).
Evaluation checklist Baseline status, when to regenerate, full command sequence.
Scale and operational limits Scale configs and limits.
Throughput comparison Throughput-focused comparison (throughput_sla, scripted baseline).
Prime Intellect Inference Env vars, CLI smoke, top-6 sweep, cross-provider.
GCP Prime runner Compute Engine VM: install, background runs, fetch results.
OpenHands SWE-bench with Prime Minimal OpenHands SWE-bench runbook with Prime preflight checks.
Benchmark results pipeline From coordination sweeps to presentation bundles.
Hospital lab key metrics Metrics that matter for hospital labs; SOTA leaderboard (main vs full), method-class comparison, run metadata, artifact paths, and coordination graphs in the UI bundle.
Uncertainty quantification Epistemic vs aleatoric; metric mapping.
Generalization and limits Tested scope, known limits, and comparison with other benchmarks.

Official pack and studies

Document Description
Official benchmark pack v0.1/v0.2 and run commands.
Hospital lab full pipeline Full-pipeline script and orchestration.
Hospital lab full pipeline results Example results report (regenerate runs as needed).
Studies and plots Study runner, make-plots.
Coordination studies Coordination study runner and Pareto.
LLM Coordination Protocol LLM coordination protocol.

Reproducibility and paper

Document Description
Determinism contract Deterministic pipeline guarantee, RNG, canonical write, cross-version limits.
Reproduce Minimal results and figures.
Paper claims Paper claims regression and snapshot.
Paper provenance Figures, tarball, commands.