docs(experiments): archive 9 rounds of A/B agent experiments#13
Merged
Conversation
Adds the empirical validation work that drove PRs #6 — #12 into the repo as a reproducibility archive under `experiments/`. Contents: - `comparison-report.md` (1199 lines, rolling Round 1 → Round 9 analysis). Documents the three Aegis ROI mechanisms surfaced from data: 1. Rule-hit → fix (brownfield Plan B: 0/3 → 3/3 across 3 models) 2. Structural guardrail (cycle / public_symbol_removed — 0/14 hits, dead weight on clean architectures) 3. Anti-paralysis ritual (weak models complete tasks they would otherwise abandon) - 4 starting-code fixtures (Python brownfield, Go brownfield, Java brownfield, Python multi-module). - 11 prompt files. Each task has paired `-a.txt` (no Aegis) and `-b.txt` (with Aegis MCP + REQUIRED-workflow ritual instruction). - 52 round directories: per-model deliverables + `run.log` for codex-driven rounds. Naming: <model>-<task>-<a|b>. - `aegis_validate.py` — Python wrapper around aegis-mcp stdio JSON-RPC, used by agents to run validation after each write. - 3 eval scripts that diff each round's deliverables against the planted SEC bugs. Excluded via `.gitignore` and rsync filter (would have been ~970MB of bloat): venv / .venv / __pycache__ / .pytest_cache / .toolchain (Go toolchain copies codex downloads) / compiled binaries / git metadata of nested repos. Final archive: 17MB. Direct lineage from this archive into Aegis code: | Round | Surfaced | Fixed in | |---|---|---| | Round 8 codex | SEC010 FP on `secrets.choice` | PR #9 | | Round 9 Go / Java | SEC009 multi-language coverage = 0 | PR #12 | | Round 9 Java | SEC010 inner-block `break` hid production case | PR #11 | Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds the empirical validation work that drove PRs #6 — #12 into
the repo as a reproducibility archive under `experiments/`. The
archive lives alongside the source code so changes to SEC rules
can reference the experiment runs that motivated them.
Size: 17MB after `.gitignore` + rsync filter pruned ~970MB of
build artefacts (venv / pycache / Go toolchain copies / etc.).
Contents
Round 9 analysis. Documents the three Aegis ROI mechanisms
surfaced from data:
hits, dead weight on clean architectures)
otherwise abandon — Round 8 Plan C surfaced this)
Java brownfield, Python multi-module.
MCP + REQUIRED-workflow ritual).
codex-driven rounds. Naming: `--<a|b>`.
JSON-RPC, used by agents to run validation after each write.
planted SEC bugs.
Why merge into the main repo (not a separate one)
Direct lineage. Every recent SEC PR ties back to a specific
experiment finding:
Keeping these in the same git tree means a regression in SEC010
two months from now can be diff'd against the Round 8 evidence
without leaving the repo.
Excluded (via .gitignore + rsync filter)
`venv/`, `.venv/`, `pycache/`, `.pytest_cache/`, `.toolchain/`
(Go toolchain copies codex downloads per run, ~30MB each), `.exe`,
`.out`, `a.out`, compiled Go binaries, nested `.git/` directories
codex created inside some round dirs, `go.sum`.
Test plan
archive)
`run.log` files (no compiled binaries leaked through)
🤖 Generated with Claude Code