Scope
US2 MVP — EvalRunner upgrades: bounded parallelism, num_runs with variance diagnostic, task-result cache (EvaluationDataStore), cooperative cancellation, initial_session_file.
Priority: P1 (MVP)
Tasks
Parallelism + num_runs
Cache
Cancellation
Initial session
Integration
Acceptance
parallelism=1 is behaviorally identical to the existing sequential runner.
parallelism=0 construction is rejected.
- Cache hit reuses the cached
Invocation across all num_runs iterations; judge-side variance only.
- Cancellation mid-run produces partial results with a cancellation indicator; no hangs.
initial_session_file is loaded before each case.
References
- Spec FR-036, FR-037, FR-038, FR-039, FR-040
- Success criteria SC-002, SC-003
- Research R-009, R-013, R-020, R-023
Depends on
#747 (foundational — needs CaseFingerprint).
Scope
US2 MVP —
EvalRunnerupgrades: bounded parallelism,num_runswith variance diagnostic, task-result cache (EvaluationDataStore), cooperative cancellation,initial_session_file.Priority: P1 (MVP)
Tasks
Parallelism + num_runs
eval/tests/runner_parallelism_test.rswith_parallelism(n)viatokio::sync::Semaphore; panic onn == 0; default 1eval/tests/runner_num_runs_test.rs— cached invocation shared across all N runs (Q2 clarification)with_num_runs(n); loop judge dispatch N times per case against sameInvocation; computestd_devintoRunnerMetricSampleCache
eval/tests/cache_test.rsEvaluationDataStoretrait +StoreErrorLocalFileTaskResultStorewith disk layout<root>/<eval_set_id>/<case_id>/<fingerprint_hex>.jsonCaseFingerprintcanonicalization (case_id, system_prompt, user_messages, initial_session, tool_set_hash, agent_model) →CacheKey; wire intoEvalRunner::run_setwith_cache(store)onEvalRunnerCancellation
eval/tests/runner_cancel_test.rs— partial result on cancel, in-flight agent + judge honor tokenwith_cancellation(tok)—tokio::select!at every await pointInitial session
eval/tests/runner_initial_session_test.rswith_initial_session_file(path)— JSON format per research §R-023 matching spec-034SessionStateIntegration
eval/tests/us2_end_to_end_test.rs— 20-case / parallelism 4 / num_runs 3 / cache hit → second-run wall-clock ≤ 20% of first with agent invocation count = 0 (SC-002, SC-003)Acceptance
parallelism=1is behaviorally identical to the existing sequential runner.parallelism=0construction is rejected.Invocationacross allnum_runsiterations; judge-side variance only.initial_session_fileis loaded before each case.References
Depends on
#747 (foundational — needs
CaseFingerprint).