Build and ship autonomous LLM training research systems
Version: v0.7.3 | License: MIT | Python: 3.11+
An autonomous research stack for continuously improving LLM training through automated experimentation. Inspired by Karpathy's autoresearch, designed for single-GPU research labs.
# Install
pip install autoresearch-stack
# Or from source
git clone https://github.com/iknowkungfubar/autoresearch-stack.git
cd autoresearch-stack
pip install -e .
# Configure (at least one API key)
export ANTHROPIC_API_KEY=sk-ant-...
# or: export OPENAI_API_KEY=sk-...
# Run the data pipeline
autoresearch --prepare-only
# Run 10 autonomous experiments
autoresearch --experiments 10
# Run with custom config
autoresearch -c my_config.yaml -i training_data.txt --experiments 100
# Python module syntax also works
python -m autoresearch --help# Test the training pipeline without PyTorch
python train_any_llm.py --demoThis runs a complete training loop using the numpy demo model, exercising the curriculum scheduler, loss tracking, and convergence detection — no GPU or PyTorch needed.
| Module | What it does |
|---|---|
data_intelligence.py |
Corpus cleaning, noise detection, text repair |
synthetic_data.py |
LLM-powered generation with Evol-Instruct |
curriculum.py |
Adaptive scheduling (linear, exponential, step, adaptive) |
storage.py |
SQLite experiment database with JSONL fallback |
| Module | What it does |
|---|---|
memory.py |
Vector store with semantic search (ChromaDB optional) |
prioritization.py |
Bandit-based selection (UCB1, epsilon-greedy, Thompson) |
hypothesis.py |
LLM-driven hypothesis generation with rule-based fallback |
feedback.py |
Reward computation, failure classification (13 types) |
multi_agent.py |
Multi-agent architecture (research, hypothesis, execution, evaluation) |
| Module | What it does |
|---|---|
sandbox.py |
Safe code execution with AST-based validation |
checkpoint.py |
State persistence and resume |
monitor.py |
Real-time status and progress bars |
daemon.py |
Background execution with health checks and auto-restart |
distribute.py |
Multi-node cluster management (Docker/K8s) |
| Module | What it does |
|---|---|
providers.py |
17+ LLM providers (Anthropic, OpenAI, OpenRouter, Ollama, vLLM, etc.) |
orchestrators.py |
7 agent orchestrators (CrewAI, AutoGen, LangChain, etc.) |
train_any_llm.py |
Training abstraction (numpy demo + optional PyTorch) |
| Module | What it does |
|---|---|
report.py |
Markdown experiment reports with comparison |
figures.py |
Matplotlib visualizations with graceful fallback |
stats.py |
Summary statistics and convergence analysis |
paper.py |
Research paper generation (Markdown/LaTeX) |
peer_review.py |
Peer review simulation (5 reviewer profiles) |
All configuration lives in config.yaml. Environment variables override YAML values:
export ANTHROPIC_API_KEY=sk-ant-... # API key (never put in config file!)
export EXPERIMENT_BUDGET=1000 # Override max experiments
export LEARNING_RATE=0.0005 # Override model LR
export SYNTHETIC_USE_LLM=true # Enable LLM data generation
export MEMORY_ENABLED=true # Enable vector memoryCloud: Anthropic (Claude), OpenAI (GPT-4/4o), OpenRouter, Google Vertex AI, Azure OpenAI, Mistral AI, Cohere, Zen AI
Local: Ollama, vLLM, LM Studio, llama.cpp, LiteLLM, KoboldCPP, LocalAI, Text Generation WebUI
Orchestrators: OpenCode, OpenCrew, AgentForge, CrewAI, AutoGen, LangChain, LlamaIndex
val_bpb (validation bits per byte) — Lower is better. The single optimization target.
- val_bpb is the ONLY metric
- ONE change per experiment
- Revert on regression
- Single-GPU focused
|| Version | Status | Tests | Coverage | Type Safety | ||---------|--------|-------|----------|-------------| || v0.7.3 | Current | 148 ✅ | 73% | 0 mypy errors | | v0.7.2 | Shipped | 104 ✅ | 57% | 43 errors | | v0.7.0 | Shipped | 53 ✅ | — | — |
# Run all tests
pytest tests/ -q
# With coverage
pytest tests/ -q --cov=./
# Run specific test file
pytest tests/test_providers.py -vdocker build -t autoresearch-stack .
docker run --rm -e ANTHROPIC_API_KEY=$ANTHROPIC_API_KEY autoresearch-stack
# Multi-node cluster
docker compose up- Karpathy autoresearch — val_bpb metric
- Ouroboros — Self-modifying systems
- AI Scientist — Paper generation
Contributions are welcome! Please read CONTRIBUTING.md for detailed guidelines on our development process, coding standards, PR workflow, and code of conduct.
MIT — see LICENSE for details.