Skip to content

[043][Phase 4][US2] Runner upgrades: parallelism, num_runs, cache, cancellation, initial_session #753

@jwesleye

Description

@jwesleye

Scope

US2 MVPEvalRunner upgrades: bounded parallelism, num_runs with variance diagnostic, task-result cache (EvaluationDataStore), cooperative cancellation, initial_session_file.

Priority: P1 (MVP)

Tasks

Parallelism + num_runs

  • T090 [P] [US2] Tests in eval/tests/runner_parallelism_test.rs
  • T091 [US2] with_parallelism(n) via tokio::sync::Semaphore; panic on n == 0; default 1
  • T092 [P] [US2] Tests in eval/tests/runner_num_runs_test.rs — cached invocation shared across all N runs (Q2 clarification)
  • T093 [US2] with_num_runs(n); loop judge dispatch N times per case against same Invocation; compute std_dev into RunnerMetricSample

Cache

  • T094 [P] [US2] Tests in eval/tests/cache_test.rs
  • T095 [US2] EvaluationDataStore trait + StoreError
  • T096 [US2] LocalFileTaskResultStore with disk layout <root>/<eval_set_id>/<case_id>/<fingerprint_hex>.json
  • T097 [US2] CaseFingerprint canonicalization (case_id, system_prompt, user_messages, initial_session, tool_set_hash, agent_model) → CacheKey; wire into EvalRunner::run_set
  • T098 [US2] with_cache(store) on EvalRunner

Cancellation

  • T099 [P] [US2] Tests in eval/tests/runner_cancel_test.rs — partial result on cancel, in-flight agent + judge honor token
  • T100 [US2] with_cancellation(tok)tokio::select! at every await point

Initial session

  • T101 [P] [US2] Tests in eval/tests/runner_initial_session_test.rs
  • T102 [US2] with_initial_session_file(path) — JSON format per research §R-023 matching spec-034 SessionState

Integration

  • T103 [US2] eval/tests/us2_end_to_end_test.rs — 20-case / parallelism 4 / num_runs 3 / cache hit → second-run wall-clock ≤ 20% of first with agent invocation count = 0 (SC-002, SC-003)

Acceptance

  • parallelism=1 is behaviorally identical to the existing sequential runner.
  • parallelism=0 construction is rejected.
  • Cache hit reuses the cached Invocation across all num_runs iterations; judge-side variance only.
  • Cancellation mid-run produces partial results with a cancellation indicator; no hangs.
  • initial_session_file is loaded before each case.

References

  • Spec FR-036, FR-037, FR-038, FR-039, FR-040
  • Success criteria SC-002, SC-003
  • Research R-009, R-013, R-020, R-023

Depends on

#747 (foundational — needs CaseFingerprint).

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or requestevalperformancePerformance improvement or regressionspecSpec-driven implementation task

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions