feat: add --core tasks and judge result caching by ScuttleBot · Pull Request #359 · pinchbench/skill

ScuttleBot · 2026-04-24T14:53:29Z

Summary

Two speedup features for faster benchmark iteration:

1. Core Tasks (`--core` flag)

Runs ~21 representative tasks (~17% of full suite) for quick signal:

At least one task per category
Favors automated grading for speed
Includes mix of difficulty levels

benchmark.py --model anthropic/claude-sonnet-4 --core

2. Judge Result Caching

Caches LLM judge responses by hash of (task_id, transcript, rubric, model):

Skips re-grading if inputs unchanged
Persistent cache in results/.judge_cache/
Logs hit rate at end of run

# Normal run with caching (default)
benchmark.py --model ... 

# Disable caching
benchmark.py --model ... --no-judge-cache

# Clear cache before run
benchmark.py --model ... --clear-judge-cache

Benefits

Scenario	Before	After
Full run, first time	~2h	~2h
Full run, same model	~2h	~30m (cache hits)
Quick iteration	~2h	~20m (`--core`)
Dev testing	~2h	~5m (`--core` + cache)

Closes #211 (partially - batch judging in separate PR)

kilo-code-bot · 2026-04-24T14:54:55Z

Code Review Summary

Status: No Issues Found | Recommendation: Merge

Solid additions across both reviews. The original judge cache and --core flag are well-structured, and the new Axiom observability layer properly opts in via env var (silently no-ops without AXIOM_API_TOKEN). The workspace snapshot fix correctly addresses the race condition between parallel grading and workspace teardown. The lib_agent.py session retry bump (15→90 attempts) and .trajectory file filter are clean targeted fixes.

Files Reviewed (8 files)

scripts/benchmark.py - Axiom integration, workspace snapshot for parallel grading, score vars
scripts/lib_axiom.py - new Axiom logging module
scripts/lib_grading.py - judge cache + judge prompt clarification
scripts/lib_agent.py - transcript retry bump, trajectory file filter, model validation fix
scripts/lib_tasks.py - core_tasks field
tasks/manifest.yaml - core task list
BENCHMARK_VERSION - version bump
README.md - whitespace

_{Reviewed by claude-4.6-sonnet-20260217 · 167,850 tokens}

Core tasks (~21 representative tasks for quick runs): - New 'core' section in manifest.yaml - At least one task per category, favoring automated grading - Use --core flag to run only these tasks - ~17% of full suite for fast iteration Judge caching: - Cache judge results by hash of (task_id, transcript, rubric, model) - Persistent cache in results/.judge_cache/ - Skip re-grading if inputs unchanged - --no-judge-cache to disable, --clear-judge-cache to reset - Logs hit rate at end of run This should significantly speed up repeated benchmark runs during development.

olearycrew force-pushed the feat/core-tasks-and-caching branch from ec5d1f7 to 676bcf4 Compare May 5, 2026 02:10

ScuttleBot merged commit 91379d5 into main May 5, 2026
2 checks passed

ScuttleBot deleted the feat/core-tasks-and-caching branch May 5, 2026 02:10

This was referenced May 5, 2026

Cache judge responses for reruns #353

Closed

feat: add judge response caching for faster reruns (#214) #355

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add --core tasks and judge result caching#359

feat: add --core tasks and judge result caching#359
ScuttleBot merged 1 commit intomainfrom
feat/core-tasks-and-caching

ScuttleBot commented Apr 24, 2026

Uh oh!

kilo-code-bot Bot commented Apr 24, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ScuttleBot commented Apr 24, 2026

Summary

1. Core Tasks (--core flag)

2. Judge Result Caching

Benefits

Uh oh!

kilo-code-bot Bot commented Apr 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Code Review Summary

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

1. Core Tasks (`--core` flag)

kilo-code-bot Bot commented Apr 24, 2026 •

edited

Loading