BLUT — Brian Lam's Universal Trainer

Rust-native orchestrator for local ML training. A compile-time typed DAG (stages → recipes → plans) with a content-addressed cache, memory containment via systemd cgroups, and built-in observability. One binary drives two domains:

LamQuant — the neural EEG-codec pipeline (data-prep → encoder / oracle / SNN / decoder / joint-codec → PCCP promotion gate). This is the primary workload.
Generic LLM — SFT, DPO, distillation, and eval over local (llama.cpp) or HuggingFace Trainer backends.

API reference: API.md · cargo doc --no-deps --open.

Why

ML pipelines accrete ad-hoc shell glue: dump data, kick off a trainer, wait, convert a checkpoint, copy it to a serving box. Each step grows its own retry logic, logging, and cache; a crash mid-run replays everything. BLUT replaces the glue with a typed pipeline:

let plan = Plan::new("finetune", json!({}))
    .start(MaterializeDataset, dataset_args())
    .then(SftTrain, sft_args())
    .then(ConvertGguf, q4_k_m())
    .then(RegisterModel, register_args())
    .finish();

Stages declare a typed Input → Output and the resources they hold (Gpu, Cpu, Network, Disk). Wrong wiring is a cargo build error, not a runtime panic.
Plans are typed DAGs; recipes compile typed args into plans (a catalog of saved workflows).
Cache content-addresses every stage output by (stage_name, schema, input_hash, args_hash) — crash mid-run, re-run, pick up where it left off.
Containment runs each stage under a systemd-run --user transient unit with MemoryMax, so a DataLoader OOM is cgroup-killed in isolation instead of taking down the whole login session.
Observability streams per-epoch metrics to a Parquet/CSV log (pyarrow) and, optionally, wandb (offline by default) — plus a status.jsonl event stream per job for live + post-hoc inspection.

Install

cargo install --path .          # binary → ~/.cargo/bin/blut

The LamQuant recipes drive a Python payload (config-dumb subprocess scripts). Install its venv once:

cd python
pip install -e '.[lamquant,observability]'

Generic-LLM GGUF recipes additionally need a llama.cpp checkout for the convert + quantize tools (default ~/llama.cpp; override with $BLUT_LLAMACPP_DIR).

Usage

Bare blut opens the interactive TUI cockpit. The common operations are also subcommands (run blut --help for the full set — recipe, jobs, log, cancel, plan, cache, stage, data, auto, policy, tui):

blut recipe list                  # catalog (name · category · I/O kinds)
blut recipe run <name> --args '{…}'
blut jobs                         # active + completed jobs
blut log <id>                     # rendered status timeline
blut cancel <id>                  # SIGTERM the job's process group
blut cache prune                  # drop stale content-addressed outputs
blut plan resume <id>             # resume a crashed DAG at first-incomplete
blut plan inspect <name>          # show a plan's stage DAG
blut tui                          # the cockpit explicitly

Run a codec training under memory containment:

BLUT_CONTAINED=1 blut recipe run lamquant_joint_codec --args '{
    "lma_roots":   ["/mnt/4tb/data/Archive/lma/tuh"],
    "manifest":    "/mnt/4tb/data/Training/splits/v11.json",
    "epochs":      40,
    "logger":      "wandb"
}'

Background re-training (cron-driven) is gated behind an explicit policy:

blut policy enable                # writes ~/.config/blut/train-policy.toml
blut auto                         # cron-mode: decide + maybe spawn

Recipes

Recipe	Category	Backend	Output
`lamquant_data_prep`	data-prep	lamquant	`lma_corpus`
`lamquant_encoder`	pipeline	lamquant	`pccp_verdict`
`lamquant_oracle`	pipeline	lamquant	`pccp_verdict`
`lamquant_snn`	pipeline	lamquant	`pccp_verdict`
`lamquant_combined_decoder`	pipeline	lamquant	`pccp_verdict`
`lamquant_full_pipeline`	pipeline	lamquant	`pccp_verdict`
`lamquant_joint_codec`	train	lamquant	`pccp_verdict`
`finetune_from_dataset`	train	lamu	`model.gguf`
`finetune_from_conversations`	train	lamu	`model.gguf`
`dpo_from_preferences`	train	lamu	`model.gguf`
`distill_from_teacher`	train	lamu	`model.gguf`
`hf_finetune_from_dataset`	train	hf_trainer	`checkpoint.hf`
`eval_suite`	eval	lamu	`eval.report`

blut recipe list is the source of truth; recipe show <name> prints its args schema.

Architecture

artifacts/   typed structs that reference on-disk bytes
             (LmaCorpus, JointCkpt, GgufModel, PccpVerdict, …)
stages/      atomic typed Input → Output units of work
framework/   Plan<Out> DAG builder · executor (per-Resource semaphores)
             · content-addressed cache · status broadcast + status.jsonl
             · Cookbook trait + Registry (see below)
recipes/     saved compositions of stages (the catalog above)
config/      layered config resolver + Cartesian sweep + Launcher trait
backends/    backend adapters (LamquantBackend, lamu, hf_trainer)
tui/         the interactive cockpit (ratatui)
python/      lamquant/ — config-dumb subprocess payload, fed frozen JSON

Cookbook direction (ADR 0037). BLUT is moving toward a domain-agnostic core plus loadable cookbooks — a cookbook is a domain pack (typed Rust stages/artifacts + a Python payload of config-dumb scripts); selecting one unlocks its recipes. The Cookbook trait + Registry seam and a standalone joint-codec recipe have landed; the full extraction of the LamQuant pack into cookbooks/lamquant/ is in progress. Config layering + sweep expansion reuse the Lerna Rust Hydra core (vendored, Apache-2.0) so Hydra-style overrides resolve natively in Rust.

PCCP gate

LamQuant model recipes terminate in a pccp_verdict — the Predetermined Change Control Plan promotion gate (fail-closed). A checkpoint cannot be promoted unless the gate scores it ≥ its registered LQS floor on real, measured fullband metrics. See the meta repo's pccp/ and LamQuant-Neural/ai_models/pccp_gate.py.

Status

Pre-1.0. The framework, cache, containment, observability, and the LamQuant codec recipes are end-to-end runnable. Some generic-LLM trainer scripts (DPO, distill) accept the typed args and emit --self-check JSON; full Python impls land iteratively. Cookbook extraction (ADR 0037 C-track) and cluster launchers (Slurm / Ray) are in progress / deferred.

License

MIT. See LICENSE.

Name		Name	Last commit message	Last commit date
Latest commit History 239 Commits
.github/workflows		.github/workflows
benches		benches
docs		docs
python		python
src		src
tests		tests
.gitignore		.gitignore
API.md		API.md
CHANGELOG.md		CHANGELOG.md
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
LICENSE		LICENSE
README.md		README.md
build.rs		build.rs

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

BLUT — Brian Lam's Universal Trainer

Why

Install

Usage

Recipes

Architecture

PCCP gate

Status

License

About

Uh oh!

Releases 1

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

BLUT — Brian Lam's Universal Trainer

Why

Install

Usage

Recipes

Architecture

PCCP gate

Status

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages