An autonomous research lab powered by AI agents. Drop your code, papers, and notes — a PI agent and PhD investigators run experiments, build a knowledge graph, and write a paper.
Built on Claude Code and the Claude Agent SDK. A PI agent orchestrates multiple PhD investigator agents that run experiments in parallel on your compute, record results in a knowledge graph, and produce a research paper with a reproducibility notebook when done.
You provide four things:
- Source material — code, papers, notes, data
- Skills — domain knowledge that guides research and writing (optional)
- Compute nodes — machines to run experiments on
- Research proposal — what to investigate, how to measure success
┌───────────────────────────────────────┐
│ PI Agent (Principal Investigator) │
sources/ │ │
skills/ ───► │ Reads everything → opens threads → │
proposal.md │ dispatches PhDs → reviews findings │
compute_nodes/ │ → generates new ideas → repeat │
│ │
│ ┌──────────┐ ┌──────────┐ │
│ │ phd_1 │ │ phd_2 │ │
│ │ Runs exp │ │ Runs exp │ │
│ │ on nodeA │ │ on nodeB │ │
│ └────┬─────┘ └────┬─────┘ │
│ └───────┬─────────┘ │
│ ▼ │
│ PI reviews, opens next thread │
│ When done: writes paper │
└───────────────────────────────────────┘
│
▼
results.jsonl + knowledge_graph.jsonl
+ paper/ + reproduce.ipynb
A simple CIFAR-10 classification task to showcase the full lab pipeline and verify everything works. The goal: improve a baseline CNN from ~70% to >85% test accuracy. It's intentionally simple — the point is to see how the PI opens threads, dispatches investigators, builds the knowledge graph, and writes a summary.
git clone git@github.com:BY571/artificial-agent-lab.git
cd artificial-agent-lab
uv sync
# Run the benchmark (~20 min, local machine, GPU auto-detected)
bash benchmark/run_benchmark.sh
# Monitor in real-time (separate terminal)
uv run streamlit run dashboard.pyNote: The dashboard is in beta — functional but still being improved in both features and design. Contributions welcome!
No config needed — everything is pre-configured in benchmark/.
The lab takes a simple 3-layer CNN (~70% accuracy on CIFAR-10) and systematically improves it. Here's what a typical benchmark run produces:
| # | Experiment | Accuracy | Change |
|---|---|---|---|
| 1 | Baseline | 74.25% | SimpleCNN, 10 epochs, SGD lr=0.01 |
| 2 | + Augmentation | 75.04% | RandomCrop + HorizontalFlip |
| 3 | + BatchNorm | 76.76% | BatchNorm2d after each conv layer |
| 4 | + Scheduling | 84.39% | CosineAnnealing, 50 epochs, lr=0.1 |
| 5 | + Regularization | 85.69% | Weight decay 1e-4 |
| 6 | Seed validation | 85.92% | Confirms with seed=43 |
~20 minutes total — 6 experiments, +11.7% accuracy improvement, knowledge graph with insights per experiment, findings report, and a summary document. The PI progressively stacks improvements: augmentation → batch normalization → learning rate scheduling → regularization → seed validation.
1. Install
git clone git@github.com:BY571/artificial-agent-lab.git
cd artificial-agent-lab
uv sync # or: pip install -r requirements.txtRequires Claude Code and the Claude Agent SDK.
2. Define compute nodes
Each machine needs a file in compute_nodes/. A local.md is included by default.
For remote machines, copy the template:
cp compute_nodes/TEMPLATE.md compute_nodes/my-gpu.mdFill in: connection (ssh user@host), hardware specs, run command, utilization (50% = shared machine, 100% = dedicated). See compute_nodes/TEMPLATE.md for all fields.
The number of investigators auto-matches the number of nodes.
3. Create a session
python scripts/init_session.py --name "my-research" --sources /path/to/my/codeThis creates autoresearch/<date>_my-research/ with sources/, skills/, threads/, and an empty research_proposal.md.
4. Add sources and skills
Sources — what to research (code, papers, data):
cp train.py model.py autoresearch/<session>/sources/
cp reference_paper.pdf autoresearch/<session>/sources/Skills — how to research (domain knowledge, .md files injected into agent prompts):
cp pytorch-patterns.md autoresearch/<session>/skills/
cp plotting-guide.md autoresearch/<session>/skills/Skills guide both research AND paper writing. Optional but recommended.
5. Write the research proposal
Edit autoresearch/<session>/research_proposal.md with: research question, hypothesis, success criteria, starting ideas, primary metric, hardware, and budget. See research_proposal_template.md for all fields.
6. Launch, monitor, stop
# Launch
python -m orchestrator.run autoresearch/<session>/
# Monitor (separate terminal)
uv run streamlit run dashboard.py
# Stop (PI finishes current thread, then writes paper)
touch autoresearch/<session>/.stop_autoresearch
# When done, merge results back to main
git checkout main && git merge research/<session-name>After a session completes, the lab produces a full research package — paper (or summary), reproducibility notebook, structured results, and a knowledge graph capturing every experiment and insight.
| Output | Description |
|---|---|
paper/paper.pdf or paper/summary.md |
Research paper or findings summary |
paper/reproduce.ipynb |
Notebook to reproduce key results |
results.jsonl |
Every experiment with metrics |
knowledge_graph.jsonl |
What was tried, why, what was learned |
research_log.md |
PI's strategic reasoning |
threads/*/findings.md |
Per-thread analysis and recommendations |
Sources (sources/) |
Skills (skills/) |
|
|---|---|---|
| What | Code, papers, data, notes | Domain knowledge, best practices |
| Purpose | What to research | How to research (and write papers) |
| Read by | Investigators | PI + Investigators (injected into prompts) |
Each machine needs a file in compute_nodes/ with: connection, hardware, utilization, run command, and constraints. Investigators are auto-assigned one per node — hardware: local+gpu1+gpu2 creates 3 investigators running in parallel.
Every experiment produces a node: what was changed, why, outcome, insights, and ideas for follow-up. The PI and investigators read it before planning — no repeated failures, discoveries compound.
| Budget | Behavior |
|---|---|
4h |
Research for 4 hours, then write paper. Soft warning at 75%. |
unlimited |
Run indefinitely until you stop it. |
| Rate Limit Policy | Behavior |
|---|---|
wait (default) |
Pause, poll every 5 min, resume when cleared. Budget clock pauses. |
stop |
End research, write paper. |
| Setting | What you get |
|---|---|
paper |
Full LaTeX paper + figures + reproduce notebook + N review rounds |
summary |
Concise markdown summary (2-4 pages, no LaTeX needed) |
- Source-driven — drop your materials, the lab reads them
- Skill-augmented — domain knowledge guides research and writing
- Hierarchical — PI thinks strategically, investigators execute
- Branch-per-session — each session isolated on its own git branch
- Knowledge compounds — the graph prevents repeated mistakes
- Append-only — results and knowledge graph survive crashes

