GitHub - Sariel2018/forage-v2: An autonomous agent architecture that learns as an organization — accumulating and transferring experience across runs, models, and task types.

Your basecamp for the unknown.

An autonomous agent architecture that learns as an organization —
accumulating and transferring experience across runs, models, and task types.

The next leap in autonomous AI isn't a bigger model — it's a smarter organization.

Highlights

Co-evolving evaluation — No human-written criteria, no manual checkpoints. Evaluation standards are discovered by an independent agent and co-evolve with execution — fully autonomous from start to stop.
Method isolation — The Evaluator and Planner cannot see each other's code. Physical workspace separation enforces audit independence.
Knowledge evolution — After each run, agents extract transferable lessons. The next team inherits accumulated organizational wisdom.
Cross-model transfer — A weaker model with a stronger model's knowledge converges 1.8x faster at 45% lower cost, arriving at the same answer three times independently.
Dual-agent, composable — The Evaluator–Planner pair is the minimal building block. Multiple pairs can run in parallel across different tasks, feeding into a shared knowledge base.

The problem

You ask an AI agent to do an open-ended task. It works for a while, declares victory, reports 100% complete. It found 15% of what exists.

We call this denominator blindness — the agent's numerator may be accurate, but it never discovered the denominator. Every current agent framework lets the agent grade its own work, and none of them catch this.

What Forage does differently

Forage doesn't make individual agents stronger. It designs institutions — audit separation, contract protocols, organizational memory — that make ordinary agents reliable.

Two agents, not one

One explores (the Planner), one maps (the Evaluator). They can't see each other's code — like an auditor who can't read the books they're auditing. The Evaluator doesn't check against a pre-written rubric. It discovers what "complete" means by independently exploring the problem space. Both evolve together.

The organization remembers

After each run, both agents independently write down what they learned. The next team reads the notebook before heading out. Over six runs, the organization accumulates 54 knowledge entries — which sources are reliable, what pitfalls exist, how the domain is structured.

Knowledge transfers across models

A weaker model, given a stronger model's accumulated knowledge, doesn't need to rediscover what the stronger model already knew.

Key results

	Without Forage	Forage V1	Forage V2
Self-reported coverage	100%	—	—
Actual coverage	15.9%	98.8%	99.7%
Knows when it's done	No	Yes	Yes
Learns across runs	No	No	Yes

V2 knowledge transfer (NVIDIA GPU benchmark)

Metric	Sonnet (cold start)	Sonnet (with Opus knowledge)	Improvement
Coverage	93.1%	98.6%	+5.5pp
Rounds to converge	7.0	4.5	1.8x faster
Cost per run	$9.40	$5.13	45% cheaper
Denominator agreement	Scattered (320–411)	Converged (266)	3 runs, same answer

Verified across task types

Task	Domain	Tool	What it tests
NVIDIA Desktop GPUs	Web scraping	Browser	Data collection at scale (265–411 candidates)
UniProt T2D Proteins	API queries	REST API	Tool generalization (28–30 candidates)
Q10 Mathematical Proof	Reasoning	Code execution	Non-collection task type
Q6 Mathematical Proof	Hard reasoning	Code execution	Capability boundary

Architecture

                    ┌─────────────────────────────────────────────┐
  Within a Run      │                                             │
                    │  ┌───────────┐   shared    ┌───────────┐   │
                    │  │ Evaluator │◄──────────►│  Planner  │   │
                    │  │ (eval.py) │  artifacts  │(action.py)│   │
                    │  └───────────┘             └───────────┘   │
                    │        ✗ no mutual code visibility ✗        │
                    └─────────────────────────────────────────────┘
                                        │
                                   post-mortem
                                        ▼
                    ┌─────────────────────────────────────────────┐
  Across Runs       │            Knowledge Base                   │
                    │  ┌───────┐ ┌───────┐       ┌───────┐      │
                    │  │ Run 1 │ │ Run 2 │  ...  │ Run N │      │
                    │  └───────┘ └───────┘       └───────┘      │
                    │                    │                        │
                    │              transfer ───► Sonnet (seeded) │
                    └─────────────────────────────────────────────┘

Method isolation is the core invariant. The Evaluator writes eval.py (how to measure); the Planner writes action.py (how to execute). Neither can read the other's script. They coordinate through a public eval_contract.md — like an auditor's terms of engagement. V2 enforces this through physical workspace separation — each agent runs in its own directory with no access to the other's files.

The vision

Your basecamp awaits.

Fog clears as the team explores. The explorer ventures out; the cartographer holds the camp.

Expedition management, team roster, knowledge assets.

Roadmap

V1  Expedition     →  Two agents establish credible judgment
V2  Organization   →  Experience accumulates and transfers         ← you are here
V3  Basecamp       →  A camp manager allocates resources dynamically
V4  Highway        →  Verified routes crystallize into reusable pipelines

V1 solved the single-run problem: how do you know the agent actually finished? (code)

V2 solves the multi-run problem: how does the organization learn? (report)

V3 will add a camp manager that dynamically allocates resources — adjusting turn budgets, swapping models, and curating the knowledge base based on accumulated experience.

V4 will crystallize verified routes into reusable pipelines — today's expedition, tomorrow's highway.

Status

This release contains the core experiment framework described in the V2 paper — the dual-agent loop, method isolation, knowledge evolution, and cross-model transfer. Under active development:

Basecamp UI and run visualization
Multi-provider agent integration
Sandbox isolation and security hardening
Camp manager (V3)

Built in spare time by a solo researcher — contributions and feedback are welcome.

Quick start

# Prerequisites: Python 3.11+, Claude Code CLI (https://claude.ai/code)
pip install -e .

# Run a single task
forage run tasks/nvidia_gpu.yaml

# Run a 6-run learning curve experiment
forage experiment tasks/nvidia_gpu.yaml

Project structure

forage/
├── agents/              # Evaluator, Planner, Executor implementations
│   ├── evaluator.py     #   Discovers what "complete" means
│   ├── planner.py       #   Decides how to execute
│   └── executor.py      #   Runs agent scripts (non-LLM)
├── core/
│   ├── loop.py          #   Multi-round Evaluator→Planner→Executor loop
│   ├── workspace.py     #   Physical isolation (eval_ws/ ↔ plan_ws/)
│   ├── knowledge.py     #   Post-mortem extraction & knowledge accumulation
│   ├── trajectory.py    #   Per-round state tracking
│   └── spec.py          #   Task spec loader (YAML)
└── experiments/
    ├── runner.py         #   Multi-run experiment orchestration
    ├── learning_curve.py #   Cross-run learning analysis
    └── single_agent.py   #   Baseline: single agent without Forage
tasks/                    # Task specifications (YAML)
tests/                    # 72 tests

Technical Report

For the full methodology, experiments, and analysis, see the arXiv paper, the project page, or download the PDF.

Citation

@article{xie2026forage,
    title={Forage: Knowledge Evolution and Cross-Model Transfer
           in Autonomous Agent Organizations},
    author={Xie, Huaqing},
    journal={arXiv preprint arXiv:2604.19837},
    year={2026},
    url={https://arxiv.org/abs/2604.19837}
}

Contact

Huaqing Xie — xhq422986742@gmail.com · huaqing.xie@foxmail.com

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 127 Commits
assets		assets
docs		docs
forage		forage
tasks		tasks
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Your basecamp for the unknown.

Highlights

The problem

What Forage does differently

Two agents, not one

The organization remembers

Knowledge transfers across models

Key results

V2 knowledge transfer (NVIDIA GPU benchmark)

Verified across task types

Architecture

The vision

Roadmap

Status

Quick start

Project structure

Technical Report

Citation

Contact

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Your basecamp for the unknown.

Highlights

The problem

What Forage does differently

Two agents, not one

The organization remembers

Knowledge transfers across models

Key results

V2 knowledge transfer (NVIDIA GPU benchmark)

Verified across task types

Architecture

The vision

Roadmap

Status

Quick start

Project structure

Technical Report

Citation

Contact

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages