Skip to content

feat: add --core tasks and judge result caching#359

Merged
ScuttleBot merged 1 commit intomainfrom
feat/core-tasks-and-caching
May 5, 2026
Merged

feat: add --core tasks and judge result caching#359
ScuttleBot merged 1 commit intomainfrom
feat/core-tasks-and-caching

Conversation

@ScuttleBot
Copy link
Copy Markdown
Contributor

Summary

Two speedup features for faster benchmark iteration:

1. Core Tasks (--core flag)

Runs ~21 representative tasks (~17% of full suite) for quick signal:

  • At least one task per category
  • Favors automated grading for speed
  • Includes mix of difficulty levels
benchmark.py --model anthropic/claude-sonnet-4 --core

2. Judge Result Caching

Caches LLM judge responses by hash of (task_id, transcript, rubric, model):

  • Skips re-grading if inputs unchanged
  • Persistent cache in results/.judge_cache/
  • Logs hit rate at end of run
# Normal run with caching (default)
benchmark.py --model ... 

# Disable caching
benchmark.py --model ... --no-judge-cache

# Clear cache before run
benchmark.py --model ... --clear-judge-cache

Benefits

Scenario Before After
Full run, first time ~2h ~2h
Full run, same model ~2h ~30m (cache hits)
Quick iteration ~2h ~20m (--core)
Dev testing ~2h ~5m (--core + cache)

Closes #211 (partially - batch judging in separate PR)

@kilo-code-bot
Copy link
Copy Markdown
Contributor

kilo-code-bot Bot commented Apr 24, 2026

Code Review Summary

Status: No Issues Found | Recommendation: Merge

Solid additions across both reviews. The original judge cache and --core flag are well-structured, and the new Axiom observability layer properly opts in via env var (silently no-ops without AXIOM_API_TOKEN). The workspace snapshot fix correctly addresses the race condition between parallel grading and workspace teardown. The lib_agent.py session retry bump (15→90 attempts) and .trajectory file filter are clean targeted fixes.

Files Reviewed (8 files)
  • scripts/benchmark.py - Axiom integration, workspace snapshot for parallel grading, score vars
  • scripts/lib_axiom.py - new Axiom logging module
  • scripts/lib_grading.py - judge cache + judge prompt clarification
  • scripts/lib_agent.py - transcript retry bump, trajectory file filter, model validation fix
  • scripts/lib_tasks.py - core_tasks field
  • tasks/manifest.yaml - core task list
  • BENCHMARK_VERSION - version bump
  • README.md - whitespace

Reviewed by claude-4.6-sonnet-20260217 · 167,850 tokens

Core tasks (~21 representative tasks for quick runs):
- New 'core' section in manifest.yaml
- At least one task per category, favoring automated grading
- Use --core flag to run only these tasks
- ~17% of full suite for fast iteration

Judge caching:
- Cache judge results by hash of (task_id, transcript, rubric, model)
- Persistent cache in results/.judge_cache/
- Skip re-grading if inputs unchanged
- --no-judge-cache to disable, --clear-judge-cache to reset
- Logs hit rate at end of run

This should significantly speed up repeated benchmark runs
during development.
@olearycrew olearycrew force-pushed the feat/core-tasks-and-caching branch from ec5d1f7 to 676bcf4 Compare May 5, 2026 02:10
@ScuttleBot ScuttleBot merged commit 91379d5 into main May 5, 2026
2 checks passed
@ScuttleBot ScuttleBot deleted the feat/core-tasks-and-caching branch May 5, 2026 02:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Task: log_apache_error_summary

2 participants