Skip to content

Cache judge responses for reruns #353

@ScuttleBot

Description

@ScuttleBot

Priority 5: Cache Judge Responses (5-10% speedup on reruns)

Problem

When rerunning benchmarks for regression testing or model comparisons, we re-grade identical transcripts.

Solution

Cache judge responses keyed by (task_id, transcript_hash).

Implementation

import hashlib
from pathlib import Path

JUDGE_CACHE_DIR = Path(".judge_cache")

def cache_key(task_id: str, transcript: list) -> str:
    transcript_str = json.dumps(transcript, sort_keys=True)
    hash_hex = hashlib.sha256(transcript_str.encode()).hexdigest()[:16]
    return f"{task_id}:{hash_hex}"

def get_cached_grade(task_id: str, transcript: list) -> dict | None:
    key = cache_key(task_id, transcript)
    cache_path = JUDGE_CACHE_DIR / f"{key}.json"
    if cache_path.exists():
        return json.loads(cache_path.read_text())
    return None

Expected Impact

  • Minimal for first run
  • Significant for reruns (regression testing)
  • Reduces judge API costs over time

Caveats

  • Only cache deterministic transcripts
  • May need opt-in flag (--use-judge-cache)
  • Cache invalidation when rubric changes

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions