[043][Phase 4][US2] Runner upgrades: parallelism, num_runs, cache, cancellation, initial_session

## Scope

**US2 MVP** — `EvalRunner` upgrades: bounded parallelism, `num_runs` with variance diagnostic, task-result cache (`EvaluationDataStore`), cooperative cancellation, `initial_session_file`.

## Priority: P1 (MVP)

## Tasks

**Parallelism + num_runs**

- [ ] T090 [P] [US2] Tests in `eval/tests/runner_parallelism_test.rs`
- [ ] T091 [US2] `with_parallelism(n)` via `tokio::sync::Semaphore`; panic on `n == 0`; default 1
- [ ] T092 [P] [US2] Tests in `eval/tests/runner_num_runs_test.rs` — cached invocation shared across all N runs (Q2 clarification)
- [ ] T093 [US2] `with_num_runs(n)`; loop judge dispatch N times per case against same `Invocation`; compute `std_dev` into `RunnerMetricSample`

**Cache**

- [ ] T094 [P] [US2] Tests in `eval/tests/cache_test.rs`
- [ ] T095 [US2] `EvaluationDataStore` trait + `StoreError`
- [ ] T096 [US2] `LocalFileTaskResultStore` with disk layout `<root>/<eval_set_id>/<case_id>/<fingerprint_hex>.json`
- [ ] T097 [US2] `CaseFingerprint` canonicalization (case_id, system_prompt, user_messages, initial_session, tool_set_hash, agent_model) → `CacheKey`; wire into `EvalRunner::run_set`
- [ ] T098 [US2] `with_cache(store)` on `EvalRunner`

**Cancellation**

- [ ] T099 [P] [US2] Tests in `eval/tests/runner_cancel_test.rs` — partial result on cancel, in-flight agent + judge honor token
- [ ] T100 [US2] `with_cancellation(tok)` — `tokio::select!` at every await point

**Initial session**

- [ ] T101 [P] [US2] Tests in `eval/tests/runner_initial_session_test.rs`
- [ ] T102 [US2] `with_initial_session_file(path)` — JSON format per research §R-023 matching spec-034 `SessionState`

**Integration**

- [ ] T103 [US2] `eval/tests/us2_end_to_end_test.rs` — 20-case / parallelism 4 / num_runs 3 / cache hit → second-run wall-clock ≤ 20% of first with agent invocation count = 0 (SC-002, SC-003)

## Acceptance

- `parallelism=1` is behaviorally identical to the existing sequential runner.
- `parallelism=0` construction is rejected.
- Cache hit reuses the cached `Invocation` across all `num_runs` iterations; judge-side variance only.
- Cancellation mid-run produces partial results with a cancellation indicator; no hangs.
- `initial_session_file` is loaded before each case.

## References

- Spec FR-036, FR-037, FR-038, FR-039, FR-040
- Success criteria SC-002, SC-003
- Research R-009, R-013, R-020, R-023

## Depends on

#747 (foundational — needs `CaseFingerprint`).


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[043][Phase 4][US2] Runner upgrades: parallelism, num_runs, cache, cancellation, initial_session #753

Scope

Priority: P1 (MVP)

Tasks

Acceptance

References

Depends on

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[043][Phase 4][US2] Runner upgrades: parallelism, num_runs, cache, cancellation, initial_session #753

Description

Scope

Priority: P1 (MVP)

Tasks

Acceptance

References

Depends on

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions