Cache judge responses for reruns

## Priority 5: Cache Judge Responses (5-10% speedup on reruns)

### Problem
When rerunning benchmarks for regression testing or model comparisons, we re-grade identical transcripts.

### Solution
Cache judge responses keyed by `(task_id, transcript_hash)`.

### Implementation
```python
import hashlib
from pathlib import Path

JUDGE_CACHE_DIR = Path(".judge_cache")

def cache_key(task_id: str, transcript: list) -> str:
    transcript_str = json.dumps(transcript, sort_keys=True)
    hash_hex = hashlib.sha256(transcript_str.encode()).hexdigest()[:16]
    return f"{task_id}:{hash_hex}"

def get_cached_grade(task_id: str, transcript: list) -> dict | None:
    key = cache_key(task_id, transcript)
    cache_path = JUDGE_CACHE_DIR / f"{key}.json"
    if cache_path.exists():
        return json.loads(cache_path.read_text())
    return None
```

### Expected Impact
- Minimal for first run
- Significant for reruns (regression testing)
- Reduces judge API costs over time

### Caveats
- Only cache deterministic transcripts
- May need opt-in flag (`--use-judge-cache`)
- Cache invalidation when rubric changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cache judge responses for reruns #353

Priority 5: Cache Judge Responses (5-10% speedup on reruns)

Problem

Solution

Implementation

Expected Impact

Caveats

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Cache judge responses for reruns #353

Description

Priority 5: Cache Judge Responses (5-10% speedup on reruns)

Problem

Solution

Implementation

Expected Impact

Caveats

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions