Skip to content

docs(experiments): archive 9 rounds of A/B agent experiments#13

Merged
wei9072 merged 1 commit into
mainfrom
feat/experiments-archive
May 7, 2026
Merged

docs(experiments): archive 9 rounds of A/B agent experiments#13
wei9072 merged 1 commit into
mainfrom
feat/experiments-archive

Conversation

@wei9072
Copy link
Copy Markdown
Owner

@wei9072 wei9072 commented May 7, 2026

Summary

Adds the empirical validation work that drove PRs #6#12 into
the repo as a reproducibility archive under `experiments/`. The
archive lives alongside the source code so changes to SEC rules
can reference the experiment runs that motivated them.

Size: 17MB after `.gitignore` + rsync filter pruned ~970MB of
build artefacts (venv / pycache / Go toolchain copies / etc.).

Contents

  • `comparison-report.md` (1199 lines) — rolling Round 1 →
    Round 9 analysis. Documents the three Aegis ROI mechanisms
    surfaced from data:
    1. Rule-hit → fix (brownfield Plan B: 0/3 → 3/3 across 3 models)
    2. Structural guardrail (cycle / public_symbol_removed — 0/14
      hits, dead weight on clean architectures)
    3. Anti-paralysis ritual (weak models complete tasks they would
      otherwise abandon — Round 8 Plan C surfaced this)
  • 4 `starting-*` fixtures: Python brownfield, Go brownfield,
    Java brownfield, Python multi-module.
  • 11 prompts — paired `-a.txt` (no Aegis) / `-b.txt` (with Aegis
    MCP + REQUIRED-workflow ritual).
  • 52 round directories — per-model deliverables + `run.log` for
    codex-driven rounds. Naming: `--<a|b>`.
  • `aegis_validate.py` — Python wrapper around aegis-mcp stdio
    JSON-RPC, used by agents to run validation after each write.
  • 3 eval scripts that diff each round's deliverables against the
    planted SEC bugs.

Why merge into the main repo (not a separate one)

Direct lineage. Every recent SEC PR ties back to a specific
experiment finding:

Round Discovered Fixed in
Round 8 codex SEC010 false-positive on `secrets.choice` PR #9
Round 9 Go / Java SEC009 multi-language coverage = 0 PR #12
Round 9 Java SEC010 inner-block `break` hid production case PR #11

Keeping these in the same git tree means a regression in SEC010
two months from now can be diff'd against the Round 8 evidence
without leaving the repo.

Excluded (via .gitignore + rsync filter)

`venv/`, `.venv/`, `pycache/`, `.pytest_cache/`, `.toolchain/`
(Go toolchain copies codex downloads per run, ~30MB each), `.exe`,
`
.out`, `a.out`, compiled Go binaries, nested `.git/` directories
codex created inside some round dirs, `go.sum`.

Test plan

  • `du -sh experiments` returns 17M (acceptable for a research
    archive)
  • `find experiments -size +500k -type f` returns only
    `run.log` files (no compiled binaries leaked through)
  • No nested `.git/` directories
  • `git ls-files experiments | wc -l` = 368
  • CI green on push

🤖 Generated with Claude Code

Adds the empirical validation work that drove PRs #6#12 into
the repo as a reproducibility archive under `experiments/`.

Contents:

- `comparison-report.md` (1199 lines, rolling Round 1 → Round 9
  analysis). Documents the three Aegis ROI mechanisms surfaced
  from data:
    1. Rule-hit → fix (brownfield Plan B: 0/3 → 3/3 across 3 models)
    2. Structural guardrail (cycle / public_symbol_removed —
       0/14 hits, dead weight on clean architectures)
    3. Anti-paralysis ritual (weak models complete tasks they
       would otherwise abandon)
- 4 starting-code fixtures (Python brownfield, Go brownfield,
  Java brownfield, Python multi-module).
- 11 prompt files. Each task has paired `-a.txt` (no Aegis) and
  `-b.txt` (with Aegis MCP + REQUIRED-workflow ritual instruction).
- 52 round directories: per-model deliverables + `run.log` for
  codex-driven rounds. Naming: <model>-<task>-<a|b>.
- `aegis_validate.py` — Python wrapper around aegis-mcp stdio
  JSON-RPC, used by agents to run validation after each write.
- 3 eval scripts that diff each round's deliverables against the
  planted SEC bugs.

Excluded via `.gitignore` and rsync filter (would have been ~970MB
of bloat): venv / .venv / __pycache__ / .pytest_cache / .toolchain
(Go toolchain copies codex downloads) / compiled binaries / git
metadata of nested repos. Final archive: 17MB.

Direct lineage from this archive into Aegis code:

| Round | Surfaced | Fixed in |
|---|---|---|
| Round 8 codex | SEC010 FP on `secrets.choice` | PR #9 |
| Round 9 Go / Java | SEC009 multi-language coverage = 0 | PR #12 |
| Round 9 Java | SEC010 inner-block `break` hid production case | PR #11 |

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@wei9072 wei9072 merged commit 93052de into main May 7, 2026
1 check passed
@wei9072 wei9072 deleted the feat/experiments-archive branch May 7, 2026 07:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant