A benchmark for whether AI agents can repair tabular preprocessing plans — not just write them from scratch. Each of 20 tasks is grounded in a real Kaggle competition and gives an agent the competition context, a dataset profile, a fixed bank of candidate preprocessing actions, and a scenario describing which actions are already active (some useful, some harmful). The agent must propose which actions to add and which prior actions to remove, and is scored deterministically against ground truth distilled from top-scoring public Kaggle solutions.
Six systems are evaluated across all 20 tasks and 4 scenario types — a human reference, a deterministic rule-based baseline, a one-shot LLM call, a generic tool-using agent, Claude Code, and a purpose-built two-pass agent — for 1,700 scored runs in total.
KaggleBench began as the cumulative team project for UIUC's CS 498 AI Agents in the Wild graduate course (Spring 2026, four-person team); my work spanned task authoring, agent implementations, the evaluation/aggregation pipeline, and the benchmark campaign. This repository is my maintained copy.
| System | Success rate | Add F1 | Remove recall | Task completion | Mean tokens/run | Mean cost/run |
|---|---|---|---|---|---|---|
| Human reference | 100.0% | 0.772 | 1.000 | 0.886 | — | — |
One-shot LLM (single_llm) |
96.25% | 0.676 | 0.878 | 0.777 | 2,034 | $0.057 |
Generic tool agent (generic_agent) |
95.00% | 0.642 | 0.889 | 0.766 | 15,855 | $0.143 |
Claude Code (claude_code) |
92.50% | 0.614 | 0.899 | 0.756 | 5,719 | $0.310 |
Two-pass agent (proposed_agent) |
91.25% | 0.676 | 0.822 | 0.749 | 20,389 | $0.196 |
Rule-based heuristics (rule_based) |
56.25% | 0.223 | 0.633 | 0.428 | — | $0.000 |
Task completion = 0.5 · Add F1 + 0.5 · Remove recall, averaged over 80 task–scenario groups per
system (success threshold 0.5). Source: agent_summary.csv,
regenerable with make aggregate.
Proposing good preprocessing is nearly solved; selectively undoing bad preprocessing is not. On from-scratch scenarios (TC1) every LLM-based method lands within 0.02 of the human reference (0.87–0.88 vs 0.89). On fault-injected and mixed-history scenarios (TC3/TC4) every method drops 0.15–0.25: remove recall falls from a uniform 1.00 on clean histories to 0.62–0.81 once harmful actions are mixed into otherwise-reasonable context. Diagnosing which prior step is hurting you is the hard, discriminating part of the benchmark.
Context engineering beat agentic scaffolding at this granularity. The one-shot LLM — a single API call over a carefully structured context (dataset profiles, action-bank summaries, scenario state) — is the best non-human system overall while using 3–10× fewer tokens and 2.5–5× less money than the agentic systems. Once the relevant evidence is precomputed into the prompt, additional tool-use autonomy mostly added cost, not accuracy.
The purpose-built agent won where its design targeted, and lost where it didn't. The two-pass add/remove agent posts the best partial-good score of any system (0.86 on TC2) and ties the best add F1, but its rollback pass is the weakest of the LLM methods (0.82 remove recall), which costs it the overall ranking. The failure mode is over-removal: pulling legitimate prior actions along with harmful ones.
task bundle ──► context builder ──► agent ──► prediction validation ──► eval ──► aggregate
(task.json, (dataset profiles, (one of (normalize/dedupe (per-run (benchmark
action bank, action summaries, six add/remove ids) scores) reports,
testcase) scenario state) systems) figures)
- Task bundles (data/tasks/) — 20 competitions spanning regression, binary/multiclass classification, and time-series forecasting. Each bundle holds the competition metadata, a candidate action bank (1,268 actions across the benchmark, median 61 per task: the gold actions plus distractors from the same action vocabulary), four scenario testcases, and a human-baseline annotation with provenance.
- Scenarios —
tc1_from_scratch(empty history),tc2_partial_good(some correct actions active),tc3_fault_injected(harmful actions active),tc4_mixed_history(both). Scenarios share the task's gold actions, so scores isolate the repair skill from task knowledge. - Agents (agent/) — shared infrastructure (bundle loading, profiling, prompt scaffolds, output validation) under six runners: rule_based, single_llm, generic_agent, claude_code, proposed_agent, plus committed human baselines. The two-pass agent separates an add-oriented recovery pass from a remove-oriented rollback pass with a final reconciliation step.
- Evaluation (eval/) — eval.py scores a single output JSON against a testcase; aggregate.py rolls runs up into per-task and benchmark-wide reports; validate_artifacts.py schema-checks every committed artifact. Scoring is pure JSON-against-JSON: deterministic, model-free, and cheap.
Design rationale, metric definitions, and the task-authoring methodology are written up in docs/DESIGN.md.
Requirements: uv. Everything below runs offline against committed artifacts — no API keys, no Kaggle account.
make setup # uv sync --dev && pre-commit install
make test # agent + evaluator unit test suites
make demo # score a committed human-baseline run end to end
make validate # schema-check all committed benchmark artifactsmake demo scores the Zillow human baseline against its from-scratch testcase and prints the
full metric breakdown (add F1 0.667, remove recall 1.000, task completion 0.833 — the same row
you'll find in the Zillow report).
Agent runs need an Anthropic API key and local copies of the competition data:
make data TASK=spaceship-titanic # kaggle CLI download (accept rules on kaggle.com first)Raw competition data is licensed by Kaggle per competition and is never committed or
redistributed — the repo ships download tooling and derived metadata only. Regenerating all
1,680 agent runs costs roughly $280 in API spend at the prices we measured (the one-shot LLM
sweep alone is about $23); per-run telemetry (tokens, tool calls, estimated cost) is recorded in
each output artifact. Reports and figures regenerate from local outputs with make aggregate
(the target refuses to run if raw outputs are absent, so a fresh clone can't clobber the
committed results).
make test runs ~170 unit tests over the deterministic core: bundle loading, action-bank
filtering and visibility policy, dataset profilers, context building, prediction validation,
output-artifact writing, each runner's orchestration (with the LLM client faked), and the
evaluator's scoring math. CI runs lint (ruff), type checking (pyright), both test suites, and
the artifact validators on every push and pull request.
agent/ Agent implementations, shared infra, prompts, profilers, tests
agentic_core/ Shared agentic loop, tool runtime, tracing
llm/ Anthropic client, auth, pricing/telemetry
<runner>/ rule_based, single_llm, generic_agent, claude_code, proposed_agent
data/schema/ JSON schemas + canonical action vocabulary
data/tasks/<slug>/ Task bundle: task.json, candidate_actions.json, testcases/, human_baseline/
data/collector/ Notebook-collection tooling used during task authoring
eval/ eval.py, aggregate.py, validators, generated results
results/benchmarks/ Per-task aggregate reports (primary + all scopes)
results/presentation/ Benchmark-wide CSVs, SVG figures, HTML dashboard
What is and isn't committed (raw data, run outputs, traces) is spelled out in ARTIFACT_POLICY.md.
- Ground truth is distilled, not executed. Gold actions come from human review of top-scoring public Kaggle notebooks, mapped onto a canonical action vocabulary. That makes scoring deterministic and reproducible, but inherits curation judgment: a defensible action missing from the bank scores as a false positive. The action-bank visibility policy (agent/action_bank.py) at least guarantees every system sees the same constrained menu.
- Action-level scoring, not outcome-level. The benchmark never trains models, so it can't credit an unconventional pipeline that happens to work. That's a deliberate trade: it makes 1,700 runs affordable and the metric stable, at the cost of treating the distilled workflow as canonical.
- The proposed agent's loss is informative, and we report it. Its two-pass design helped exactly where designed (best TC2 score, tied-best add F1) and its rollback reconciliation over-removes. We kept the result rather than tuning until the headline flattered the custom system.
- Human reference covers TC1 only (annotators built from-scratch baselines; n/a cells in the heatmap). It bounds add-side performance but not repair scenarios.
- Stochastic agents ran 5 tries per task–scenario; reported numbers are group means. Token and cost figures come from per-run telemetry and exclude the human and rule-based rows, which consume no API.
- Evaluation design is the product. The constrained action bank — fixed menu, add/remove edits, deterministic scoring — is what made comparing six heterogeneous systems possible at all. Open-ended "improve this pipeline" formulations had no defensible ground truth.
- Spend your tokens on context, then on autonomy. Precomputing dataset profiles and action summaries into a structured prompt beat giving agents tools to discover the same facts, on both cost and accuracy.
- Removal is a different skill from addition. Every system, including the strongest commercial agent, finds proposing good actions far easier than identifying which existing actions are harmful. Benchmarks that only test from-scratch generation overstate agent ability on real, messy workflows.
- Schema-validate everything. 1,700 runs produced across multiple machines and six systems stayed mergeable because every artifact validates against JSON schemas in CI before it lands.
MIT — see LICENSE.