Priority 5: Cache Judge Responses (5-10% speedup on reruns)
Problem
When rerunning benchmarks for regression testing or model comparisons, we re-grade identical transcripts.
Solution
Cache judge responses keyed by (task_id, transcript_hash).
Implementation
import hashlib
from pathlib import Path
JUDGE_CACHE_DIR = Path(".judge_cache")
def cache_key(task_id: str, transcript: list) -> str:
transcript_str = json.dumps(transcript, sort_keys=True)
hash_hex = hashlib.sha256(transcript_str.encode()).hexdigest()[:16]
return f"{task_id}:{hash_hex}"
def get_cached_grade(task_id: str, transcript: list) -> dict | None:
key = cache_key(task_id, transcript)
cache_path = JUDGE_CACHE_DIR / f"{key}.json"
if cache_path.exists():
return json.loads(cache_path.read_text())
return None
Expected Impact
- Minimal for first run
- Significant for reruns (regression testing)
- Reduces judge API costs over time
Caveats
- Only cache deterministic transcripts
- May need opt-in flag (
--use-judge-cache)
- Cache invalidation when rubric changes
Priority 5: Cache Judge Responses (5-10% speedup on reruns)
Problem
When rerunning benchmarks for regression testing or model comparisons, we re-grade identical transcripts.
Solution
Cache judge responses keyed by
(task_id, transcript_hash).Implementation
Expected Impact
Caveats
--use-judge-cache)