An autonomous agent architecture that learns as an organization —
accumulating and transferring experience across runs, models, and task types.
The next leap in autonomous AI isn't a bigger model — it's a smarter organization.
- Co-evolving evaluation — No human-written criteria, no manual checkpoints. Evaluation standards are discovered by an independent agent and co-evolve with execution — fully autonomous from start to stop.
- Method isolation — The Evaluator and Planner cannot see each other's code. Physical workspace separation enforces audit independence.
- Knowledge evolution — After each run, agents extract transferable lessons. The next team inherits accumulated organizational wisdom.
- Cross-model transfer — A weaker model with a stronger model's knowledge converges 1.8x faster at 45% lower cost, arriving at the same answer three times independently.
- Dual-agent, composable — The Evaluator–Planner pair is the minimal building block. Multiple pairs can run in parallel across different tasks, feeding into a shared knowledge base.
You ask an AI agent to do an open-ended task. It works for a while, declares victory, reports 100% complete. It found 15% of what exists.
We call this denominator blindness — the agent's numerator may be accurate, but it never discovered the denominator. Every current agent framework lets the agent grade its own work, and none of them catch this.
Forage doesn't make individual agents stronger. It designs institutions — audit separation, contract protocols, organizational memory — that make ordinary agents reliable.
One explores (the Planner), one maps (the Evaluator). They can't see each other's code — like an auditor who can't read the books they're auditing. The Evaluator doesn't check against a pre-written rubric. It discovers what "complete" means by independently exploring the problem space. Both evolve together.
After each run, both agents independently write down what they learned. The next team reads the notebook before heading out. Over six runs, the organization accumulates 54 knowledge entries — which sources are reliable, what pitfalls exist, how the domain is structured.
A weaker model, given a stronger model's accumulated knowledge, doesn't need to rediscover what the stronger model already knew.
| Without Forage | Forage V1 | Forage V2 | |
|---|---|---|---|
| Self-reported coverage | 100% | — | — |
| Actual coverage | 15.9% | 98.8% | 99.7% |
| Knows when it's done | No | Yes | Yes |
| Learns across runs | No | No | Yes |
| Metric | Sonnet (cold start) | Sonnet (with Opus knowledge) | Improvement |
|---|---|---|---|
| Coverage | 93.1% | 98.6% | +5.5pp |
| Rounds to converge | 7.0 | 4.5 | 1.8x faster |
| Cost per run | $9.40 | $5.13 | 45% cheaper |
| Denominator agreement | Scattered (320–411) | Converged (266) | 3 runs, same answer |
| Task | Domain | Tool | What it tests |
|---|---|---|---|
| NVIDIA Desktop GPUs | Web scraping | Browser | Data collection at scale (265–411 candidates) |
| UniProt T2D Proteins | API queries | REST API | Tool generalization (28–30 candidates) |
| Q10 Mathematical Proof | Reasoning | Code execution | Non-collection task type |
| Q6 Mathematical Proof | Hard reasoning | Code execution | Capability boundary |
┌─────────────────────────────────────────────┐
Within a Run │ │
│ ┌───────────┐ shared ┌───────────┐ │
│ │ Evaluator │◄──────────►│ Planner │ │
│ │ (eval.py) │ artifacts │(action.py)│ │
│ └───────────┘ └───────────┘ │
│ ✗ no mutual code visibility ✗ │
└─────────────────────────────────────────────┘
│
post-mortem
▼
┌─────────────────────────────────────────────┐
Across Runs │ Knowledge Base │
│ ┌───────┐ ┌───────┐ ┌───────┐ │
│ │ Run 1 │ │ Run 2 │ ... │ Run N │ │
│ └───────┘ └───────┘ └───────┘ │
│ │ │
│ transfer ───► Sonnet (seeded) │
└─────────────────────────────────────────────┘
Method isolation is the core invariant. The Evaluator writes eval.py (how to measure); the Planner writes action.py (how to execute). Neither can read the other's script. They coordinate through a public eval_contract.md — like an auditor's terms of engagement. V2 enforces this through physical workspace separation — each agent runs in its own directory with no access to the other's files.
Fog clears as the team explores. The explorer ventures out; the cartographer holds the camp.
Expedition management, team roster, knowledge assets.
V1 Expedition → Two agents establish credible judgment
V2 Organization → Experience accumulates and transfers ← you are here
V3 Basecamp → A camp manager allocates resources dynamically
V4 Highway → Verified routes crystallize into reusable pipelines
V1 solved the single-run problem: how do you know the agent actually finished? (code)
V2 solves the multi-run problem: how does the organization learn? (report)
V3 will add a camp manager that dynamically allocates resources — adjusting turn budgets, swapping models, and curating the knowledge base based on accumulated experience.
V4 will crystallize verified routes into reusable pipelines — today's expedition, tomorrow's highway.
This release contains the core experiment framework described in the V2 paper — the dual-agent loop, method isolation, knowledge evolution, and cross-model transfer. Under active development:
- Basecamp UI and run visualization
- Multi-provider agent integration
- Sandbox isolation and security hardening
- Camp manager (V3)
Built in spare time by a solo researcher — contributions and feedback are welcome.
# Prerequisites: Python 3.11+, Claude Code CLI (https://claude.ai/code)
pip install -e .
# Run a single task
forage run tasks/nvidia_gpu.yaml
# Run a 6-run learning curve experiment
forage experiment tasks/nvidia_gpu.yamlforage/
├── agents/ # Evaluator, Planner, Executor implementations
│ ├── evaluator.py # Discovers what "complete" means
│ ├── planner.py # Decides how to execute
│ └── executor.py # Runs agent scripts (non-LLM)
├── core/
│ ├── loop.py # Multi-round Evaluator→Planner→Executor loop
│ ├── workspace.py # Physical isolation (eval_ws/ ↔ plan_ws/)
│ ├── knowledge.py # Post-mortem extraction & knowledge accumulation
│ ├── trajectory.py # Per-round state tracking
│ └── spec.py # Task spec loader (YAML)
└── experiments/
├── runner.py # Multi-run experiment orchestration
├── learning_curve.py # Cross-run learning analysis
└── single_agent.py # Baseline: single agent without Forage
tasks/ # Task specifications (YAML)
tests/ # 72 tests
For the full methodology, experiments, and analysis, see the arXiv paper, the project page, or download the PDF.
@article{xie2026forage,
title={Forage: Knowledge Evolution and Cross-Model Transfer
in Autonomous Agent Organizations},
author={Xie, Huaqing},
journal={arXiv preprint arXiv:2604.19837},
year={2026},
url={https://arxiv.org/abs/2604.19837}
}Huaqing Xie — xhq422986742@gmail.com · huaqing.xie@foxmail.com
MIT

