Intelligent model selection, adaptive pruning for multi-model local or remote systems. Run larger workloads on smaller RAM by coordinating specialists and a coordinator, pruning and quantizing the KV cache adaptively, and improving specialists over time using Autoresearcher learning loops.
A self-improving hierarchy, not a single giant model:
- Coordinator handles planning, long context, cross-domain reasoning
- Specialists own narrow tasks (coding, math, vision, video) and stay small and fast
- Autoresearcher watches escalations, teaches specialists from coordinator corrections, and promotes updates only when regression-guarded
- Pruning + quantization are model-aware and pressure-aware, not global constants
The cycle: small model runs → overhang detected → coordinator answers → teaching trace → capability delta → regression-guarded promotion → leaner specialist next run.
Install the package itself:
pip install .For a home-computer setup that will run local MLX models and the benchmarking tools, install the optional extras:
pip install ".[mlx,bench,plots]"Then ask the package what your machine can realistically do:
amk doctorIf you want a durable user-local workspace for registry evolution, Autoresearcher artifacts, and runtime evidence, initialize it once:
amk initOnce amk init and (optionally) the MLX venv are ready, a single command
runs the full autoresearcher loop end-to-end — baseline measurement,
adaptive KV-cache measurement, regression-guarded promotion, and a report
with PNG charts:
amk auto --mlx-python /tmp/mlx-venv/bin/pythonOutputs land in <workspace>/artifacts/auto/<timestamp>/:
report.md— scorecard with before/after KV footprint and throughputkv_savings.png,throughput.png,convergence.png— chartssummary.json— machine-readable rollup
Each amk auto call also teaches the registry: the observer compares the
measured kv_bytes_per_token against the predicted value and applies an
EMA-damped, regression-guarded update (lineage + rollback intact) so the
next run's policy decisions are grounded in this machine's reality, not
the manufacturer's spec sheet.
See docs/closing-the-gap.md for the full story of how the predicted → measured gap was closed.
That reports:
- bundled default registry
- initialized user-local workspace when present
- detected RAM tier
- likely MLX Python runtimes
- discovered local models under
~/.mlx/models - suggested next commands
After amk init, AMK prefers the user-local registry and artifact workspace.
Without initialization, the bundled registry is still used by default. You
only need --registry when you want to point at a custom file.
# route a task
amk route --task coding --context-length 16000 --requires-tools
# simulate a foreground session
amk session --task coding \
--prompt "Update pricing logic for the new packaging model" \
--requires-business-context --ambiguity-level 0.85
# benchmark a specialist, producing a FrontierVector
amk benchmark --target qwen-coder-specialistLegacy scripts/*.py entry points still work for backward compat; new work should use python3 -m adaptive_model_kit.cli <subcommand>.
- Prefer the smallest specialist that can likely succeed.
- Escalate to the coordinator when ambiguity, cross-domain, or high-impact conditions appear.
- Make pruning and quantization depend on model profile and runtime pressure, not a global constant.
- Specialists own tasks strongly but never own truth absolutely — overhang always escalates.
- High-impact outputs pass through consensus before commit.
- Every new concept earns its place by closing a loop, not opening one.
src/adaptive_model_kit/
├── models/ # data types: routing, compression, lifecycle, evolution
├── routing.py # model selection with context-safe fallbacks
├── policies.py # runtime-aware pruning + quantization decisions
├── profiling.py # static model priors (attention, modality, KV density)
├── pruning.py # token scoring and cold/hot decisions
├── tiered_cache.py # hot/cold KV demotion and restore
├── lifecycle.py # foreground session loop with escalation + consensus
├── consensus.py # overhang detection, agent contracts, consensus policy
├── specialist_evolution.py # teaching traces, capability deltas, frontier
├── evolution/ # closed-loop: benchmark → train → promote
│ ├── benchmark.py # BenchmarkSuite → FrontierVector from real runs
│ ├── trainer.py # v0 prompt-hint injection distillation
│ └── promote.py # registry swap with lineage + rollback
├── artifacts.py # JSON serialization for all artifact types
├── registry.py # JSON model registry loader
├── autoresearcher_runtime.py # session persistence and update detection
├── mlx_executor.py # real MLX backend executor
└── cli.py # unified `amk` entry point
Agent Contract. Every model carries owns, can_assist, blind_spots, overhang_conditions, escalation_targets, consensus_required, commit_mode. Specialists complete work locally but know when to defer.
Overhang. The zone where a specialist is locally competent but not globally authoritative: business context, architecture sensitivity, multimodal interpretation, high ambiguity or impact. Detected per-session, triggers escalation.
Consensus. First-class decision: none, reviewer, coordinator, arbitration.
Frontier. FrontierVector tracks 7 dimensions (quality, calibration, escalation quality, latency/token/memory/context efficiency). HeadroomReport surfaces the gap and ranks the cheapest next interventions.
Closed Loop. benchmark measures a specialist on a fixed suite, trainer applies a DistillationBatch, promote updates the registry with lineage once RegressionGuard passes. Rollback swaps back to the prior entry.
- docs/closing-the-gap.md — technical article on how the predicted→measured gap was closed end-to-end (AdaptiveKVCache → runtime observer →
amk auto) - docs/validation.md — how the three kit-level claims are tested (improves, fits, persists) + the
amk validatescorecard - docs/benchmarks.md — KV footprint computed from real
~/.mlxconfigs (stage-by-stage) - docs/real-run-evidence.md — live MLX measurements + the "can a 35B dense run on 8 GiB?" feasibility answer
- docs/testing.md — repeatable operator playbook with all CLI scenarios
- docs/autoresearcher-test-cases.md — focused improvement-loop scenarios
examples/model-registry.json— reference registry for the five included specialists
