sandbox

Three open-weight models — Kimi K2.6, DeepSeek V4 Pro, MiniMax M2.7 — driven through opencode against the same Python spec, then blind-judged by each other. Hidden tests, structured scoring, peer-vs-expert delta when an expert tier is configured. opencode-only stack: every implementer and every judge is a model accessible through opencode, with a slug you control via bench/config.json.

Two things in one repo:

sandbox.py — a single-file Python tool that wraps Podman (or Docker) to run commands inside ephemeral, network-isolated, resource-capped containers. The thing the models implement.
bench/ — the framework that runs them through opencode, captures transcripts/diffs/test-output, and aggregates judgments into a single review.

The recursive joke: the first task wired up in the framework is implementing sandbox.py itself. The repo's flagship deliverable is also its first benchmark.

Round 1 — superseded methodology

⚠ The round-1 artifacts (model run dirs and aggregated review) have been removed from the working tree. Round 1 used Claude Code and codex CLI for the expert tier — a different methodology than the opencode-only framework now in this repo. Re-running under the new setup gives different (and comparable across rounds) numbers, so the old artifacts were dropped to avoid mixed-methodology confusion.

The round-1 narrative writeup at blog/sandbox-2026-04-30.md is preserved as historical record. The original implementer slugs were:

Label	opencode slug
`kimi`	`opencode-go/kimi-k2.6`
`deepseek`	`opencode-go/deepseek-v4-pro`
`minimax`	`opencode-go/minimax-m2.7`

A fresh round under the new framework is pending. Until it lands, python3 demo.py reports "no review yet" — that's expected.

How `bench/` works

You drop a spec in bench/tasks/<task>/SPEC.md, write a PROMPT.md the implementer model sees, and a hidden test suite the model never sees. Each model runs through opencode (or any harness) and produces a sandbox.py.

Then every implementation gets graded by a judge panel:

Expert tier (opt-in) — any opencode model not in your implementer set. Add labels to expert_judges in bench/config.json and put their slugs in the slugs map. Enables the peer-vs-expert delta.
Peer tier — every implementer becomes a judge of the other implementations. Blinded labels, no self-judging. Always on.

A structured-JSON scoring schema is parsed directly into a single review document with: scoreboard split by tier, per-judge ranking, peer-vs-expert delta (self-bias check), inter-judge agreement (range, stdev), and objective hidden-test results alongside the medians.

Try it

See what the latest round produced without setting anything up:

python3 demo.py

Prints the scoreboard from bench/reviews/sandbox-<date>.md if a round has been captured. Currently reports "no review yet" — round 1 was under a different methodology and was dropped; a fresh round under the opencode-only framework is pending. The narrative writeup of round 1 is still at blog/sandbox-2026-04-30.md.

What a round will produce: a single review file with scoreboard, peer-vs-expert delta (when an expert tier is configured), inter-judge agreement, and per-impl detail.

What you'll spend: ~$0.15 in API calls (3 implementer runs + 3 peer judge runs through opencode; expert judges optional) and ~30–45 minutes of supervised choreography. Stdlib + pytest only on the local side; no requirements file.

Forking for your own three-way comparison

Edit bench/config.json — replace the three implementer labels and their opencode slugs. Optionally add expert judges (any opencode model not in your implementer set) to enable the peer-vs-expert delta:
```
{
  "implementers": ["yourmodel-a", "yourmodel-b", "yourmodel-c"],
  "expert_judges": [],
  "harness": "opencode",
  "slugs": {
    "yourmodel-a": "opencode-go/<provider-model-id>",
    "yourmodel-b": "opencode-go/<provider-model-id>",
    "yourmodel-c": "opencode-go/<provider-model-id>"
  }
}
```
Labels become <label>/ dirs at the repo root. Slugs are what opencode run passes via --model. To add an expert: list its label under expert_judges and put its slug in slugs.

Implementation phase — one model at a time:

# auto-drive (recommended): opencode runs, then capture chains automatically
bench/scripts/start-run.sh --auto sandbox kimi

# or manual: open opencode in the worktree, set the model, paste PROMPT.md
bench/scripts/start-run.sh sandbox kimi
bench/scripts/capture-run.sh sandbox kimi

capture-run.sh finds the opencode session for that worktree, exports it as JSON, populates meta.json (cost, tokens, model slug), runs the hidden tests, and archives everything under kimi/sandbox-<date>/. Repeat for the other two implementers.

Judgment phase — after all three implementations are captured:

# auto-drive: every judge with a slug in config runs through opencode
bench/scripts/start_judgments.py --auto sandbox

# or manual:
bench/scripts/start_judgments.py sandbox
# → packets at bench/judgments/sandbox-<date>/<judge>/; drive each
#   judge in its harness, write JSON to output/

Any judge label without a slug in bench/config.json is left for manual driving — useful if you want to keep one tier on a different harness without breaking the rest of the auto-drive flow.

bench/scripts/aggregate_judges.py sandbox
# → bench/reviews/sandbox-<date>.md, the artifact you came for

One-shot end-to-end: bench/scripts/run-all.sh sandbox chains every implementer in config.json (sequential, --auto), then judgments, then aggregation. Per-model failures are non-fatal; a summary prints at the end. If your default branch is not main or master, set BASE_BRANCH=<branch> in the environment.

New task (later rounds): bench/scripts/new-task.sh <name> scaffolds the directory layout. Write SPEC.md, PROMPT.md, hidden tests/, and the judge rubric. The framework is task-agnostic; nothing about the scripts assumes Python.

Container runtime falls back from Podman to Docker. opencode-specific auto-capture activates only when opencode is on PATH; without it the flow degrades to a hand-saved transcript.md and partially populated meta.json.

⚠ --auto runs opencode run --dangerously-skip-permissions. The model has full host filesystem access during the session, so any prompt or file in bench/tasks/<task>/ can be acted on. Trust the task content before using --auto. Recursive irony noted: this project builds a sandbox; auto-drive runs unsandboxed agents to do it. A future ROADMAP item is to run opencode itself inside sandbox.py once stable.

Status

Round 1 (2026-04-30) — superseded. Original artifacts dropped from the working tree because the methodology (separate Claude Code / codex CLI for the expert tier) doesn't match the opencode-only framework now in this repo. Narrative writeup preserved at blog/sandbox-2026-04-30.md.
Round 1.5 (pending) — same three implementer families (kimi, deepseek, minimax), same spec, but rerun fresh under the new opencode-only stack so the numbers are comparable across future rounds.
Round 2 (planned) — new spec extension (likely streaming output or persistent sessions). The competition is longitudinal: same contestants, evolving tool, round-over-round drift visible in each <model>/sandbox-<date>/ archive. See ROADMAP.md.

Each model gets a top-level dir at the repo root: deepseek/, kimi/, minimax/. Inside:

<model>/sandbox.py — the current implementation (latest captured round, copied up by capture-run.sh after each round).
<model>/<task>-<date>/ — round archive: sandbox.py snapshot, diff.patch, transcript.md, test-output.txt, meta.json.

Once round 2 lands, ls kimi/ shows kimi's history alongside the current code, and diff kimi/sandbox-2026-04-30/sandbox.py kimi/sandbox.py is the drift in one command.

Limitations

This is a small, opinionated framework — not a definitive leaderboard. Read results accordingly:

n=1 task, n=1 run per cell. One ~100-LOC Python tool, implemented once per model. Standard ML benchmarks repeat 3+ times to bound sampling noise. Don't generalize to "model X is better at code" from one cell.
Harness-locked. Implementations and (going forward) judges all driven via opencode. A different harness — Cursor, Aider, Claude Code — would surface different behaviour. Rankings are harness-conditioned. Round 1 used Claude Code and codex CLI for the expert tier, hence the claude/ and codex/ packet directories under that round's judgments.
Contained, not real. A frozen single-file spec is a clean signal but a narrow one. Real code work is multi-file, ambiguously specified, and full of legacy. Don't extrapolate to production-codebase agent loops.
Pretraining overlap. Argparse + subprocess + Podman flags are in every model's training data. This is a follow-a-clear-contract bench, not a novel-reasoning bench.
Judge panel is a parameter. A different five judges would likely produce a different consensus. Inter-judge variance is reported in the review so you can see it.
Self-reported costs. Token / cost numbers are hand-pulled from each provider's dashboard, not auto-captured. Cost-per-passing-test is directional.

The framework is built to be re-run cheaply when new models drop. If you want statistical rigour, run it yourself with n≥3 and your own harness.

Repo layout

PLAN.md                       # high-level plan for sandbox.py
CLAUDE.md                     # working notes for AI agents in this repo
LICENSE
README.md
deepseek/                     # per-model dir at repo root
├── sandbox.py                # current implementation (latest round)
└── sandbox-<date>/           # round archive: sandbox.py + diff + transcript + test output + meta
kimi/
├── sandbox.py
└── sandbox-<date>/
minimax/
├── sandbox.py
└── sandbox-<date>/
blog/                         # public-facing curated writeups (one per round)
└── sandbox-<date>.md         # human-readable narrative; linked from personal blog
bench/                        # framework only
├── README.md                 # framework docs
├── tasks/<task>/             # frozen spec + hidden tests + judge rubric
├── judgments/                # per-judge packets (gitignored — review.md is the signal)
├── reviews/                  # aggregated multi-judge reviews (auto-generated)
└── scripts/                  # start-run, capture-run, start_judgments, aggregate_judges

Why this exists

Vibes-based model picking is the default. Someone writes a thread, you try a model, you have an opinion. New version drops, you start over. There's no comparable artifact next time someone asks "is Kimi K2 actually any good for code?"

The smallest framework that gives a frozen contract, hidden tests, and multi-judge grading — including peers — is what's here.

Adding your own task

bench/scripts/new-task.sh <name> scaffolds bench/tasks/<name>/ with the five files a task needs:

PROMPT.md — what the implementer sees. Keep test details out.
SPEC.md — the frozen contract. Changing it invalidates prior runs.
tests/ — hidden tests. Copied into the worktree only at capture.
JUDGE_PROMPT.md + JUDGE_RUBRIC.md — what judges see and fill in.
rubric.md — long-form scoring sheet for human reviewers.

The framework is task-agnostic. The first task is Python; nothing about the scripts assumes Python.

License

MIT — see LICENSE.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

sandbox

Round 1 — superseded methodology

How `bench/` works

Try it

Forking for your own three-way comparison

Status

Limitations

Repo layout

Why this exists

Adding your own task

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 31 Commits
bench		bench
blog		blog
deepseek		deepseek
kimi		kimi
minimax		minimax
.gitignore		.gitignore
CLAUDE.md		CLAUDE.md
LICENSE		LICENSE
PLAN.md		PLAN.md
README.md		README.md
ROADMAP.md		ROADMAP.md
WORKLOG.md		WORKLOG.md
demo.py		demo.py

Folders and files

Latest commit

History

Repository files navigation

sandbox

Round 1 — superseded methodology

How bench/ works

Try it

Forking for your own three-way comparison

Status

Limitations

Repo layout

Why this exists

Adding your own task

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

How `bench/` works

Packages