Local-first experiment tracking for deep learning. Two components: a Python SDK for logging and a Rust TUI for visualization.
Storage: single .extract/ directory with SQLite (WAL mode) + file artifacts. No server.
from extract import Store
store = Store() # reads from .extract/ in current directoryStore(root=".extract")
root: str | Path— path to.extract/directory. Created if absent. Default:".extract".- Hierarchy is read from
config.toml([store] hierarchy = "...").
store.experiment(spec: dict[str, str]) -> Experiment
- Keys must be hierarchy levels in order from root. Creates intermediate nodes. Values cannot skip levels.
store.experiment({"benchmark": "imagenet", "model": "resnet50", "variant": "v1"}) store.experiment({"benchmark": "imagenet"}) # partial — stops at benchmark level
store.list_experiments(prefix="") -> list[Experiment]
- Returns all experiments, optionally filtered by path prefix.
store.todo(content: str, priority: int = 0) -> None
- Creates a global-scoped TODO. Priority: 0=low, 1=mid, 2=high.
store.list_todos(scope_type="global", scope_id=None) -> list[dict]
- Returns TODOs filtered by scope. Ordered by priority DESC, then created_at.
store.close() -> None
- Closes the database connection.
Returned by store.experiment(). Properties: id: str, path: str, name: str.
experiment.run(config=None, name=None, total_steps=None) -> Run
config: dict | None— JSON-serializable config dict. Stored as JSON string.name: str | None— human-readable run name.total_steps: int | None— declares the training-loop length so the TUI's curve chart can pin its x-axis at[0, total_steps - 1]from the moment the chart appears (curve fills left-to-right rather than rescaling). Optional; when unset, the chart auto-fits to the largest observed step.- Returns a
Run. Usable as a context manager or directly. Automatically captureshostnameandgit_sha(current HEAD). - Run status:
"running"on creation,"completed"or"failed"on finish.
experiment.list_runs() -> list[dict]
- Returns all runs for this experiment as dicts, ordered by
started_at.
Usable as a context manager or via direct calls. Properties: id: str.
# Context manager — auto-finishes on exit
with experiment.run(config={"lr": 0.001}, name="run-001") as run:
run.log(loss=1.0)
# Direct call — finish explicitly
run = experiment.run(config={"lr": 0.001}, name="run-002")
run.log(loss=1.0)
run.finish()run.finish(status="completed") -> None
- Flushes metric buffer, sets
ended_atandstatus. - Idempotent — safe to call multiple times.
- Called automatically by
__exit__when used as context manager. - All mutating methods (
log,log_table,tag, etc.) raiseRuntimeErrorafterfinish().
run.log(**kwargs: float | int | str) -> None
- No
stepparameter — headline metrics are single values per run, not time-series. Re-logging the same metric name overwrites the previous value (INSERT OR REPLACE semantics). - Numeric values (int, float) →
scalar_metricstable (headline metrics). - String values →
run_paramstable (categorical parameters, deduplicated by name). - Buffered in memory; flushed every 100 entries or on
finish(). wall_timeis automatically recorded (seconds since run start).- Shown in the TUI metrics summary, comparison tables, branch rankings, and the MCP
compare_runsheadline columns. Use this for the small set of values you want to summarize the run. - Passing
step=as a kwarg raisesTypeError— useRun.curve()for per-step streaming values.
# Headline values, each call overwrites on re-log.
run.log(final_loss=0.3, final_accuracy=0.9, arch="resnet18")
# Later re-log — single row per name, latest wins.
run.log(final_loss=0.28)run.curve(step: int, **kwargs: float | int) -> None
- Numeric values only — strings raise
TypeError.boolis also rejected (Python'sbool <: intwould otherwise silently coerce to1.0/0.0). - Writes to the
curve_pointstable, which is physically separate fromscalar_metrics. Headline-summary surfaces (Summary panel, branch rankings, MCPcompare_runsheadline columns) never see this data. - The TUI's chart panel reads from
curve_points. Use this for the high-frequency per-step training values that drive a live chart but should not clutter the run summary. - Buffered with a smaller threshold (10 points) plus a wall-clock fallback (~2 seconds) so slow training loops still feel live in the TUI. Flushed at threshold, after the wall-clock window, or on
finish(). Likerun.log(), buffered points are lost on hard process kill (SIGKILL) — always use awithblock sofinish()flushes on exit. wall_timeis automatically recorded.- Combined with
total_steps=Nonexperiment.run(...), the TUI chart's x-axis is pinned at[0, N-1]and the curve fills left-to-right.
with experiment.run(config={"lr": 0.01}, total_steps=1000) as run:
for step in range(1000):
loss, acc = train_step(...)
run.curve(step=step, train_loss=loss, train_acc=acc)
run.log(final_acc=acc) # one headline metric for the Summaryrun.log_table(name: str, data: np.ndarray, step=None, axes=None) -> None
- Saves NumPy array as
.npyfile underartifacts/{run_id}/matrices/. axes: dict | None— metadata like{"rows": "task", "cols": "step"}.- File:
{name}[_step_{step}].npy
run.log_text(name: str, content: str) -> None
- Saves as markdown under
artifacts/{run_id}/text/{name}.md.
run.tag(*tags: str) -> None
- Appends tags. Stored as JSON array in
runs.tags.
run.note(content: str) -> None
- Appends to run's notes (newline-separated).
run.todo(content: str, priority: int = 0) -> None
- Creates a TODO scoped to this run. Priority: 0=low, 1=mid, 2=high.
run.register_model(name: str, version: str, path: str, metadata=None, framework="pytorch") -> None
- Copies model file/directory to
models/{name}/{version}/. - Records in
modelstable.(name, version)must be unique.
run.derived_from(run=None, model=None, version=None) -> None
- Records this run as derived from another run (by run ID) or model (by name+version or model ID).
run.branched_from(experiment=None, run=None) -> None
- Records this run as branched from an experiment (by ID) or another run (by ID).
from extract import syncsync.push(root: Path, remote: str) -> None
- Rsync local store to remote. Checkpoints WAL first.
sync.pull(root: Path, remote: str) -> dict[str, int]
- Rsync remote into temp dir, merge DB and artifacts into local store.
- Returns
{table_name: rows_added}.
sync.export_archive(root: Path, output: Path) -> None
- Creates
.tar.gzof the store.
sync.import_archive(archive: Path, root: Path) -> dict[str, int]
- Extracts archive, merges into target store.
- Returns
{table_name: rows_added}.
sync.merge_db(src_path: Path, dst_path: Path) -> dict[str, int]
- Low-level DB merge. Experiments matched by
path(not ID). Runs use ULIDs so never collide. Returns per-table row counts.
extract init [path] [--hierarchy "a > b > c"] [--no-gitignore]
extract tui [--store .extract]
extract sync push <remote> [--root .extract]
extract sync pull <remote> [--root .extract]
extract sync export <output.tar.gz> [--root .extract]
extract sync import <archive.tar.gz> [--root .extract]extract init — Bootstrap a .extract/ store with a hierarchy. Interactive by default; use --hierarchy for non-interactive/CI use. Refuses if the store is already configured.
pip install extract-tracker # SDK + CLI + TUI binary + MCP serverRead-only MCP (Model Context Protocol) server that exposes the store to LLM agents (Claude Code, Claude Desktop, any MCP-capable host). Lets an agent inspect experiments, runs, metrics, configs, lineage, models, and TODOs without writing Python.
python -m extract.mcp [--store PATH]--store PATH— path to the.extract/directory. Default:.extract(resolved relative to the server's cwd, which is the MCP host's cwd). Launchingclaudein a project folder automatically binds to that project's store.- Transport: stdio. Spawned by the MCP host as a subprocess; not invoked interactively.
- Read-only. No tools that mutate the store in v1.
Add a project-scoped .mcp.json at the project root:
{
"mcpServers": {
"extract": {
"command": ".venv/bin/python",
"args": ["-m", "extract.mcp"]
}
}
}The relative command resolves against Claude's cwd (= project root when launched there). For globally-installed Python: replace with the absolute path or python if it's on $PATH.
All 8 tools are read-only. Run IDs are ULIDs; agents discover them via list_runs / search and pass them to other tools verbatim. Listing tools share an envelope: {items, total, truncated}, default limit=50, hard cap 500 (clamps silently with limit_clamped: true).
list_experiments(prefix: str = "", limit: int = 50) -> dict
- Lists experiments, optionally filtered by path prefix. Each item:
{id, path, name, node_type, parent_id, n_runs}.
list_runs(experiment_id: str | None = None, limit: int = 50) -> dict
- All runs (newest-first) or scoped to one experiment. Each item:
{id, label, experiment_id, experiment_path, name, status, started_at, ended_at, tags, git_sha, hostname, config_summary}.config_summaryis{n_keys, top_level_keys}— callget_runfor the full config.
get_run(run_id: str) -> dict
- Full detail:
{id, experiment_id, experiment_path, name, label, status, started_at, ended_at, hostname, git_sha, tags, notes, config, metrics_final, metrics_available, run_params, artifacts, todos}.metrics_finalis the headline value of each scalar metric; streaming curve histories are NOT included (usecompare_runswithinclude_curves=True).
compare_runs(run_ids: list[str], include_curves: bool = False) -> dict
- 2–10 runs. Returns
{runs, metrics, config_diffs}plus an optional top-levelcurvesfield wheninclude_curves=True. Per metric inmetrics:{direction, values, ranking}.directionis"min"or"max"from[metrics].minimize/[metrics].maximizeinconfig.toml, then name heuristics (loss,error,mse, etc. → min; everything else → max).rankingis best-to-worst by the headline value.config_diffsonly contains keys where at least two runs differ; nested configs are flattened with dot notation (method.lora_r). Wheninclude_curves=True, thecurvesfield has shape{name: {run_id: [[step, value], ...]}}and is sourced from thecurve_pointstable (populated byrun.curve()during training) — curve metrics are separate from headline metrics and may have different names.
search(query: str = "", filters: dict | None = None, limit: int = 50) -> dict
- Substring + structured filter search over runs. Returns the same shape as
list_runs. query: case-insensitive substring against run name, tags, notes. Empty = no text filter.filters(all AND-combined, all optional):tag(str),status("running" | "completed" | "failed"),experiment_prefix(str),started_after(ISO 8601),started_before(ISO 8601).
list_todos(scope_type: str = "global", scope_id: str | None = None, include_done: bool = False, limit: int = 50) -> dict
scope_type:"global" | "experiment" | "run".scope_idrequired for non-global scopes, must beNonefor global.include_done=Falseby default.
get_lineage(node_type: str, node_id: str, direction: str = "both", depth: int = 2) -> dict
- BFS walk of the lineage DAG.
node_type:"experiment" | "run" | "model".direction:"ancestors" | "descendants" | "both".depth: 1–5. Returns flat{root, nodes, edges}(not a tree — handles DAG diamonds). Labels:path#namefor runs,pathfor experiments,name@versionfor models.
list_models(name_prefix: str = "", limit: int = 50) -> dict
- Each item:
{id, name, version, run_id, framework, artifact_path, metadata, created_at}.
All tool-visible errors raise ValueError with agent-readable strings: "Run not found: 'X'", "compare_runs requires at least 2 run_ids (got 1)", "Unknown filter: 'X'. Valid filters: tag, status, experiment_prefix, started_after, started_before", etc. Listing tools accept limit > 500 and silently clamp (with limit_clamped: true in the response) rather than erroring — the agent can recover without retrying.
SQLite with WAL journal mode. 9 tables:
Hierarchical namespace nodes.
| Column | Type | Notes |
|---|---|---|
id |
TEXT PK | ULID |
path |
TEXT NOT NULL | slash-delimited, e.g. "imagenet/resnet50/v1" |
name |
TEXT NOT NULL | leaf name of this node |
parent_id |
TEXT FK → experiments | NULL for root nodes |
created_at |
TEXT | ISO 8601 |
metadata |
TEXT | JSON, nullable |
status |
TEXT | default "created" |
node_type |
TEXT | hierarchy level name, e.g. "benchmark" |
Single execution within an experiment.
| Column | Type | Notes |
|---|---|---|
id |
TEXT PK | ULID |
experiment_id |
TEXT FK → experiments | NOT NULL |
name |
TEXT | human-readable |
config |
TEXT | JSON dict |
started_at |
TEXT | ISO 8601, auto-set |
ended_at |
TEXT | set on finish() |
status |
TEXT | "running", "completed", "failed" |
hostname |
TEXT | auto-captured |
git_sha |
TEXT | auto-captured from HEAD |
tags |
TEXT | JSON array of strings |
notes |
TEXT | plain text, newline-separated |
total_steps |
INTEGER | nullable; declared training-loop length for chart x-axis pin |
Headline metrics from run.log(). The TUI Summary panel, branch rankings, and the MCP compare_runs headline columns all read this table.
| Column | Type | Notes |
|---|---|---|
id |
INTEGER PK | autoincrement |
run_id |
TEXT FK → runs | |
step |
INTEGER | |
name |
TEXT | metric name |
value |
REAL | |
wall_time |
REAL | seconds since run start |
Unique constraint: (run_id, name, step).
Streaming-curve points from run.curve(). The TUI's detail-panel and compare-view chart panels read this table; it is never read by headline-summary queries, so high-frequency training values cannot pollute the Summary panel or branch rankings.
| Column | Type | Notes |
|---|---|---|
run_id |
TEXT FK → runs | NOT NULL |
name |
TEXT | metric name, NOT NULL |
step |
INTEGER | NOT NULL |
value |
REAL | NOT NULL |
wall_time |
REAL | seconds since run start |
Unique constraint: (run_id, name, step) — no surrogate PK.
Files associated with a run.
| Column | Type | Notes |
|---|---|---|
id |
TEXT PK | ULID |
run_id |
TEXT FK → runs | |
name |
TEXT | artifact name |
kind |
TEXT | "matrix" or "text" |
step |
INTEGER | nullable |
rel_path |
TEXT | relative to store root |
shape |
TEXT | JSON array for matrices |
dtype |
TEXT | numpy dtype string |
metadata |
TEXT | JSON, e.g. {"axes": {...}} |
created_at |
TEXT | ISO 8601 |
Versioned model registry.
| Column | Type | Notes |
|---|---|---|
id |
TEXT PK | ULID |
name |
TEXT | model name |
version |
TEXT | version string |
run_id |
TEXT FK → runs | nullable |
artifact_path |
TEXT | absolute path to copied model |
framework |
TEXT | e.g. "pytorch" |
metadata |
TEXT | JSON, nullable |
created_at |
TEXT | ISO 8601 |
Unique constraint: (name, version).
Directed edges between experiments, runs, and models.
| Column | Type | Notes |
|---|---|---|
id |
INTEGER PK | autoincrement |
parent_type |
TEXT | "experiment", "run", "model" |
parent_id |
TEXT | |
child_type |
TEXT | "experiment", "run", "model" |
child_id |
TEXT | |
relation |
TEXT | "derived_from", "branched_from" |
metadata |
TEXT | JSON, nullable |
created_at |
TEXT | ISO 8601 |
Unique constraint: (parent_type, parent_id, child_type, child_id, relation).
Categorical/string key-value attributes for a run.
| Column | Type | Notes |
|---|---|---|
id |
INTEGER PK | autoincrement |
run_id |
TEXT FK → runs | |
name |
TEXT | param name |
value |
TEXT | param value |
Unique constraint: (run_id, name).
User-defined level ordering.
| Column | Type | Notes |
|---|---|---|
level_order |
INTEGER PK | 0-indexed |
level_name |
TEXT UNIQUE | e.g. "benchmark" |
Task notes scoped to global, experiment, or run.
| Column | Type | Notes |
|---|---|---|
id |
TEXT PK | ULID |
scope_type |
TEXT | "global", "experiment", "run" |
scope_id |
TEXT | nullable (NULL for global) |
content |
TEXT | |
done |
INTEGER | 0 or 1 |
priority |
INTEGER | 0=low, 1=mid, 2=high |
created_at |
TEXT | ISO 8601 |
completed_at |
TEXT | nullable |
.extract/
├── extract.db # SQLite database
├── extract.db-wal # WAL journal (auto-managed)
├── extract.db-shm # shared memory (auto-managed)
├── config.toml # TUI configuration
├── sync.lock # present during sync operations
├── artifacts/
│ └── {run_id}/
│ ├── matrices/{name}.npy # NumPy arrays
│ └── text/{name}.md # markdown text
└── models/
└── {model_name}/{version}/ # copied model files
| Key | Type | Default | Description |
|---|---|---|---|
hierarchy |
string |
(none) | Experiment tree levels separated by >, e.g. "benchmark > model > variant" |
Controls the Summary tab in the Detail panel (selected via S).
| Key | Type | Default | Description |
|---|---|---|---|
sections |
string[] |
["runs", "metrics", "tables", "curves"] |
Display order in detail panel |
curve_width |
int |
80 |
Chart width as % of panel (1-100) |
curve_height |
int |
auto | Chart height in lines (auto-scales by metric count: 12/10/8/6) |
curve_smooth |
bool |
false |
Catmull-Rom curve interpolation |
Controls the Info tab in the Detail panel (selected via I) and the Config section in Compare/Diff views. Nested configs are flattened with dot-notation (e.g. method.lora_r, task.num_train_epochs).
| Key | Type | Default | Description |
|---|---|---|---|
fields |
string[] |
[] (show all) |
Glob patterns for which config keys to display |
Dots act as path separators so glob semantics apply per segment:
| Pattern | Matches | Does not match |
|---|---|---|
method.* |
method.name, method.lora_r |
method.deep.nested |
method.** |
method.name, method.deep.nested |
model.name |
*.name |
method.name, model.name |
method.deep.name |
**.name |
method.name, method.deep.name |
method.lora_r |
method.lora_* |
method.lora_r, method.lora_alpha |
method.name |
method.lora_? |
method.lora_r |
method.lora_alpha |
{method,model}.* |
method.name, model.name |
task.name |
!method.parent |
(excludes method.parent) |
Negation patterns (!) exclude matching keys. Combine with positive patterns: ["method.**", "!method.parent"] shows everything under method except parent. When empty (default), all config keys are shown.
Controls the Compare view (triggered via c with marked runs).
| Key | Type | Default | Description |
|---|---|---|---|
sections |
string[] |
["pivot", "config", "tables", "curves"] |
Display order in compare view |
curve_width |
int |
50 |
Chart width as % of panel (1-100) |
curve_height |
int |
auto | Chart height in lines (auto-scales by metric count: 12/10/8/6) |
Determines metric direction (minimize vs maximize) for ranking, comparison arrows, and improvement highlighting.
| Key | Type | Default | Description |
|---|---|---|---|
minimize |
string[] |
[] |
Metrics where lower is better |
maximize |
string[] |
[] |
Metrics where higher is better |
Unlisted metrics fall back to name-based heuristics (see Metric Direction).
Ordered highlight rules for matrix/table cells. First match wins.
| Field | Type | Description |
|---|---|---|
eq |
float |
Exact value match |
min |
float |
Inclusive lower bound |
max |
float |
Exclusive upper bound |
pattern |
string |
Substring match |
color |
string |
Color name (see below) |
Colors: "red", "green", "yellow", "blue", "cyan", "magenta", "white", "black", "darkgray", "orange", "none" (terminal default), or any hex color (e.g. "#ff6600").
All values are hex color strings. Omitted fields use ANSI 16-color defaults.
| Key | Default | Description |
|---|---|---|
fg |
"#cdd6f4" |
Foreground text |
bg |
"#1e1e2e" |
Background |
accent |
"#89b4fa" |
Primary accent |
accent_dim |
"#585b70" |
Dimmed accent |
success |
"#a6e3a1" |
Success indicators |
warning |
"#f9e2af" |
Warning indicators |
error |
"#f38ba8" |
Error indicators |
border |
"#585b70" |
Unfocused borders |
border_focused |
"#89b4fa" |
Focused borders |
| Key | Type | Default | Description |
|---|---|---|---|
timeout |
int |
3 |
Auto-dismiss timeout in seconds |
Metric direction (minimize vs maximize) determines ranking, comparison arrows, and improvement highlighting.
Override direction for specific metrics in config.toml:
[metrics]
minimize = ["forgetting_rate", "custom_loss"]
maximize = ["custom_score"]Metrics not listed in config use name-based heuristics. These patterns are recognized as minimize (lower is better):
loss, error, perplexity, mse, mae, rmse, nll, cer, wer, fid, divergence
All other unlisted metrics default to maximize (higher is better).
Assumes config.toml has [store] hierarchy = "benchmark > model > variant".
import numpy as np
from extract import Store
store = Store()
# Create experiment hierarchy
exp = store.experiment({
"benchmark": "imagenet",
"model": "resnet50",
"variant": "lr-sweep"
})
# Context manager approach
with exp.run(config={"lr": 0.001, "bs": 32, "epochs": 50}, name="resnet50-lr1e3", total_steps=50) as run:
for step in range(50):
# Streaming curve points — drive the live chart, not the headline summary.
run.curve(step=step, train_loss=2.3 - step * 0.04, train_acc=step * 0.018)
# Headline metrics — appear in Summary panel and rankings.
run.log(final_acc=0.92)
run.log(arch="resnet50", optimizer="sgd")
run.log_table("confusion_matrix", np.random.rand(1000, 1000), axes={"rows": "true", "cols": "predicted"})
run.log_text("notes", "## Observations\nResNet50 with lr=1e-3 converges stably.")
run.tag("sweep", "production-candidate")
run.note("Best learning rate in sweep")
run.todo("Try lr=0.005 next", priority=1)
run.register_model("resnet50-imagenet", "v1.0", "/tmp/model.pt", framework="pytorch")
run.derived_from(model="baseline-imagenet", version="v0.9")
# Direct call approach — run spans multiple scopes
run = exp.run(config={"lr": 0.005}, name="resnet50-lr5e3")
run.log(loss=2.1)
run.tag("sweep")
run.finish() # explicit finalize (idempotent)
# Global TODO
store.todo("Write up results for paper", priority=2)
store.close()rust/src/
├── main.rs # entry point, CLI args, main loop
├── app.rs # AppState, View/Focus enums, all application state
├── db.rs # SQLite read-only access layer
├── model.rs # data structs: Experiment, Run, ScalarMetric, Artifact, Model, etc.
├── config.rs # TOML config parsing, theme/color handling
├── keys.rs # keybinding constants and key matching (h/l=tab, arrows=nav/cycle)
├── event.rs # event handling (Key, Tick, Resize)
├── artifact.rs # NumPy table loading, CellValue handling
└── ui/
├── layout.rs # main layout orchestrator + event dispatcher
├── tree.rs # experiment tree navigator
├── detail.rs # detail panel (Summary/Info tabs)
├── dashboard.rs # dashboard view
├── summary.rs # renders runs, metrics, curves, tables
├── chart.rs # line chart rendering
├── compare.rs # compare view (side-by-side runs)
├── diff.rs # diff view (config differences)
├── search.rs # search popup
├── popup.rs # run picker, run browser, delete confirm
├── selection.rs # selection window for marked runs
├── registry.rs # model registry view
├── lineage.rs # lineage DAG view
├── todo.rs # TODO management view
├── help.rs # help overlay
├── statusbar.rs # status bar
├── theme.rs # theme color/style definitions
└── heatmap.rs # heatmap visualization
python/src/extract/
├── __init__.py # public API: Store, Experiment, Run, sync
├── __main__.py # CLI entry point (extract tui / sync)
├── store.py # Store class: DB, hierarchy, experiment creation
├── experiment.py # Experiment class: run creation, run listing
├── run.py # Run class: logging, artifacts, models, lineage
├── metrics.py # helpers: save_npy, load_npy, save_text
├── sync.py # sync: rsync, tar archives, DB merging
└── mcp.py # MCP server: 8 read-only tools over stdio