Skip to content

Latest commit

 

History

History
650 lines (490 loc) · 26.6 KB

File metadata and controls

650 lines (490 loc) · 26.6 KB

Extract — LLM Agent Documentation

Local-first experiment tracking for deep learning. Two components: a Python SDK for logging and a Rust TUI for visualization.

Storage: single .extract/ directory with SQLite (WAL mode) + file artifacts. No server.


Python SDK API

Store

from extract import Store

store = Store()  # reads from .extract/ in current directory

Store(root=".extract")

  • root: str | Path — path to .extract/ directory. Created if absent. Default: ".extract".
  • Hierarchy is read from config.toml ([store] hierarchy = "...").

store.experiment(spec: dict[str, str]) -> Experiment

  • Keys must be hierarchy levels in order from root. Creates intermediate nodes. Values cannot skip levels.
    store.experiment({"benchmark": "imagenet", "model": "resnet50", "variant": "v1"})
    store.experiment({"benchmark": "imagenet"})  # partial — stops at benchmark level

store.list_experiments(prefix="") -> list[Experiment]

  • Returns all experiments, optionally filtered by path prefix.

store.todo(content: str, priority: int = 0) -> None

  • Creates a global-scoped TODO. Priority: 0=low, 1=mid, 2=high.

store.list_todos(scope_type="global", scope_id=None) -> list[dict]

  • Returns TODOs filtered by scope. Ordered by priority DESC, then created_at.

store.close() -> None

  • Closes the database connection.

Experiment

Returned by store.experiment(). Properties: id: str, path: str, name: str.

experiment.run(config=None, name=None, total_steps=None) -> Run

  • config: dict | None — JSON-serializable config dict. Stored as JSON string.
  • name: str | None — human-readable run name.
  • total_steps: int | None — declares the training-loop length so the TUI's curve chart can pin its x-axis at [0, total_steps - 1] from the moment the chart appears (curve fills left-to-right rather than rescaling). Optional; when unset, the chart auto-fits to the largest observed step.
  • Returns a Run. Usable as a context manager or directly. Automatically captures hostname and git_sha (current HEAD).
  • Run status: "running" on creation, "completed" or "failed" on finish.

experiment.list_runs() -> list[dict]

  • Returns all runs for this experiment as dicts, ordered by started_at.

Run

Usable as a context manager or via direct calls. Properties: id: str.

# Context manager — auto-finishes on exit
with experiment.run(config={"lr": 0.001}, name="run-001") as run:
    run.log(loss=1.0)

# Direct call — finish explicitly
run = experiment.run(config={"lr": 0.001}, name="run-002")
run.log(loss=1.0)
run.finish()

Lifecycle

run.finish(status="completed") -> None

  • Flushes metric buffer, sets ended_at and status.
  • Idempotent — safe to call multiple times.
  • Called automatically by __exit__ when used as context manager.
  • All mutating methods (log, log_table, tag, etc.) raise RuntimeError after finish().

Headline Metrics

run.log(**kwargs: float | int | str) -> None

  • No step parameter — headline metrics are single values per run, not time-series. Re-logging the same metric name overwrites the previous value (INSERT OR REPLACE semantics).
  • Numeric values (int, float) → scalar_metrics table (headline metrics).
  • String values → run_params table (categorical parameters, deduplicated by name).
  • Buffered in memory; flushed every 100 entries or on finish().
  • wall_time is automatically recorded (seconds since run start).
  • Shown in the TUI metrics summary, comparison tables, branch rankings, and the MCP compare_runs headline columns. Use this for the small set of values you want to summarize the run.
  • Passing step= as a kwarg raises TypeError — use Run.curve() for per-step streaming values.
# Headline values, each call overwrites on re-log.
run.log(final_loss=0.3, final_accuracy=0.9, arch="resnet18")
# Later re-log — single row per name, latest wins.
run.log(final_loss=0.28)

Streaming Curves

run.curve(step: int, **kwargs: float | int) -> None

  • Numeric values only — strings raise TypeError. bool is also rejected (Python's bool <: int would otherwise silently coerce to 1.0/0.0).
  • Writes to the curve_points table, which is physically separate from scalar_metrics. Headline-summary surfaces (Summary panel, branch rankings, MCP compare_runs headline columns) never see this data.
  • The TUI's chart panel reads from curve_points. Use this for the high-frequency per-step training values that drive a live chart but should not clutter the run summary.
  • Buffered with a smaller threshold (10 points) plus a wall-clock fallback (~2 seconds) so slow training loops still feel live in the TUI. Flushed at threshold, after the wall-clock window, or on finish(). Like run.log(), buffered points are lost on hard process kill (SIGKILL) — always use a with block so finish() flushes on exit.
  • wall_time is automatically recorded.
  • Combined with total_steps=N on experiment.run(...), the TUI chart's x-axis is pinned at [0, N-1] and the curve fills left-to-right.
with experiment.run(config={"lr": 0.01}, total_steps=1000) as run:
    for step in range(1000):
        loss, acc = train_step(...)
        run.curve(step=step, train_loss=loss, train_acc=acc)
    run.log(final_acc=acc)   # one headline metric for the Summary

Artifacts

run.log_table(name: str, data: np.ndarray, step=None, axes=None) -> None

  • Saves NumPy array as .npy file under artifacts/{run_id}/matrices/.
  • axes: dict | None — metadata like {"rows": "task", "cols": "step"}.
  • File: {name}[_step_{step}].npy

run.log_text(name: str, content: str) -> None

  • Saves as markdown under artifacts/{run_id}/text/{name}.md.

Tags & Notes

run.tag(*tags: str) -> None

  • Appends tags. Stored as JSON array in runs.tags.

run.note(content: str) -> None

  • Appends to run's notes (newline-separated).

TODOs

run.todo(content: str, priority: int = 0) -> None

  • Creates a TODO scoped to this run. Priority: 0=low, 1=mid, 2=high.

Model Registry

run.register_model(name: str, version: str, path: str, metadata=None, framework="pytorch") -> None

  • Copies model file/directory to models/{name}/{version}/.
  • Records in models table. (name, version) must be unique.

Lineage

run.derived_from(run=None, model=None, version=None) -> None

  • Records this run as derived from another run (by run ID) or model (by name+version or model ID).

run.branched_from(experiment=None, run=None) -> None

  • Records this run as branched from an experiment (by ID) or another run (by ID).

Sync Module

from extract import sync

sync.push(root: Path, remote: str) -> None

  • Rsync local store to remote. Checkpoints WAL first.

sync.pull(root: Path, remote: str) -> dict[str, int]

  • Rsync remote into temp dir, merge DB and artifacts into local store.
  • Returns {table_name: rows_added}.

sync.export_archive(root: Path, output: Path) -> None

  • Creates .tar.gz of the store.

sync.import_archive(archive: Path, root: Path) -> dict[str, int]

  • Extracts archive, merges into target store.
  • Returns {table_name: rows_added}.

sync.merge_db(src_path: Path, dst_path: Path) -> dict[str, int]

  • Low-level DB merge. Experiments matched by path (not ID). Runs use ULIDs so never collide. Returns per-table row counts.

CLI

extract init [path] [--hierarchy "a > b > c"] [--no-gitignore]
extract tui [--store .extract]
extract sync push <remote> [--root .extract]
extract sync pull <remote> [--root .extract]
extract sync export <output.tar.gz> [--root .extract]
extract sync import <archive.tar.gz> [--root .extract]

extract init — Bootstrap a .extract/ store with a hierarchy. Interactive by default; use --hierarchy for non-interactive/CI use. Refuses if the store is already configured.

Install

pip install extract-tracker            # SDK + CLI + TUI binary + MCP server

MCP Server

Read-only MCP (Model Context Protocol) server that exposes the store to LLM agents (Claude Code, Claude Desktop, any MCP-capable host). Lets an agent inspect experiments, runs, metrics, configs, lineage, models, and TODOs without writing Python.

Entry Point

python -m extract.mcp [--store PATH]
  • --store PATH — path to the .extract/ directory. Default: .extract (resolved relative to the server's cwd, which is the MCP host's cwd). Launching claude in a project folder automatically binds to that project's store.
  • Transport: stdio. Spawned by the MCP host as a subprocess; not invoked interactively.
  • Read-only. No tools that mutate the store in v1.

Registering with Claude Code

Add a project-scoped .mcp.json at the project root:

{
  "mcpServers": {
    "extract": {
      "command": ".venv/bin/python",
      "args": ["-m", "extract.mcp"]
    }
  }
}

The relative command resolves against Claude's cwd (= project root when launched there). For globally-installed Python: replace with the absolute path or python if it's on $PATH.

Tools

All 8 tools are read-only. Run IDs are ULIDs; agents discover them via list_runs / search and pass them to other tools verbatim. Listing tools share an envelope: {items, total, truncated}, default limit=50, hard cap 500 (clamps silently with limit_clamped: true).

list_experiments(prefix: str = "", limit: int = 50) -> dict

  • Lists experiments, optionally filtered by path prefix. Each item: {id, path, name, node_type, parent_id, n_runs}.

list_runs(experiment_id: str | None = None, limit: int = 50) -> dict

  • All runs (newest-first) or scoped to one experiment. Each item: {id, label, experiment_id, experiment_path, name, status, started_at, ended_at, tags, git_sha, hostname, config_summary}. config_summary is {n_keys, top_level_keys} — call get_run for the full config.

get_run(run_id: str) -> dict

  • Full detail: {id, experiment_id, experiment_path, name, label, status, started_at, ended_at, hostname, git_sha, tags, notes, config, metrics_final, metrics_available, run_params, artifacts, todos}. metrics_final is the headline value of each scalar metric; streaming curve histories are NOT included (use compare_runs with include_curves=True).

compare_runs(run_ids: list[str], include_curves: bool = False) -> dict

  • 2–10 runs. Returns {runs, metrics, config_diffs} plus an optional top-level curves field when include_curves=True. Per metric in metrics: {direction, values, ranking}. direction is "min" or "max" from [metrics].minimize / [metrics].maximize in config.toml, then name heuristics (loss, error, mse, etc. → min; everything else → max). ranking is best-to-worst by the headline value. config_diffs only contains keys where at least two runs differ; nested configs are flattened with dot notation (method.lora_r). When include_curves=True, the curves field has shape {name: {run_id: [[step, value], ...]}} and is sourced from the curve_points table (populated by run.curve() during training) — curve metrics are separate from headline metrics and may have different names.

search(query: str = "", filters: dict | None = None, limit: int = 50) -> dict

  • Substring + structured filter search over runs. Returns the same shape as list_runs.
  • query: case-insensitive substring against run name, tags, notes. Empty = no text filter.
  • filters (all AND-combined, all optional): tag (str), status ("running" | "completed" | "failed"), experiment_prefix (str), started_after (ISO 8601), started_before (ISO 8601).

list_todos(scope_type: str = "global", scope_id: str | None = None, include_done: bool = False, limit: int = 50) -> dict

  • scope_type: "global" | "experiment" | "run". scope_id required for non-global scopes, must be None for global. include_done=False by default.

get_lineage(node_type: str, node_id: str, direction: str = "both", depth: int = 2) -> dict

  • BFS walk of the lineage DAG. node_type: "experiment" | "run" | "model". direction: "ancestors" | "descendants" | "both". depth: 1–5. Returns flat {root, nodes, edges} (not a tree — handles DAG diamonds). Labels: path#name for runs, path for experiments, name@version for models.

list_models(name_prefix: str = "", limit: int = 50) -> dict

  • Each item: {id, name, version, run_id, framework, artifact_path, metadata, created_at}.

Errors

All tool-visible errors raise ValueError with agent-readable strings: "Run not found: 'X'", "compare_runs requires at least 2 run_ids (got 1)", "Unknown filter: 'X'. Valid filters: tag, status, experiment_prefix, started_after, started_before", etc. Listing tools accept limit > 500 and silently clamp (with limit_clamped: true in the response) rather than erroring — the agent can recover without retrying.


Database Schema

SQLite with WAL journal mode. 9 tables:

experiments

Hierarchical namespace nodes.

Column Type Notes
id TEXT PK ULID
path TEXT NOT NULL slash-delimited, e.g. "imagenet/resnet50/v1"
name TEXT NOT NULL leaf name of this node
parent_id TEXT FK → experiments NULL for root nodes
created_at TEXT ISO 8601
metadata TEXT JSON, nullable
status TEXT default "created"
node_type TEXT hierarchy level name, e.g. "benchmark"

runs

Single execution within an experiment.

Column Type Notes
id TEXT PK ULID
experiment_id TEXT FK → experiments NOT NULL
name TEXT human-readable
config TEXT JSON dict
started_at TEXT ISO 8601, auto-set
ended_at TEXT set on finish()
status TEXT "running", "completed", "failed"
hostname TEXT auto-captured
git_sha TEXT auto-captured from HEAD
tags TEXT JSON array of strings
notes TEXT plain text, newline-separated
total_steps INTEGER nullable; declared training-loop length for chart x-axis pin

scalar_metrics

Headline metrics from run.log(). The TUI Summary panel, branch rankings, and the MCP compare_runs headline columns all read this table.

Column Type Notes
id INTEGER PK autoincrement
run_id TEXT FK → runs
step INTEGER
name TEXT metric name
value REAL
wall_time REAL seconds since run start

Unique constraint: (run_id, name, step).

curve_points

Streaming-curve points from run.curve(). The TUI's detail-panel and compare-view chart panels read this table; it is never read by headline-summary queries, so high-frequency training values cannot pollute the Summary panel or branch rankings.

Column Type Notes
run_id TEXT FK → runs NOT NULL
name TEXT metric name, NOT NULL
step INTEGER NOT NULL
value REAL NOT NULL
wall_time REAL seconds since run start

Unique constraint: (run_id, name, step) — no surrogate PK.

artifacts

Files associated with a run.

Column Type Notes
id TEXT PK ULID
run_id TEXT FK → runs
name TEXT artifact name
kind TEXT "matrix" or "text"
step INTEGER nullable
rel_path TEXT relative to store root
shape TEXT JSON array for matrices
dtype TEXT numpy dtype string
metadata TEXT JSON, e.g. {"axes": {...}}
created_at TEXT ISO 8601

models

Versioned model registry.

Column Type Notes
id TEXT PK ULID
name TEXT model name
version TEXT version string
run_id TEXT FK → runs nullable
artifact_path TEXT absolute path to copied model
framework TEXT e.g. "pytorch"
metadata TEXT JSON, nullable
created_at TEXT ISO 8601

Unique constraint: (name, version).

lineage

Directed edges between experiments, runs, and models.

Column Type Notes
id INTEGER PK autoincrement
parent_type TEXT "experiment", "run", "model"
parent_id TEXT
child_type TEXT "experiment", "run", "model"
child_id TEXT
relation TEXT "derived_from", "branched_from"
metadata TEXT JSON, nullable
created_at TEXT ISO 8601

Unique constraint: (parent_type, parent_id, child_type, child_id, relation).

run_params

Categorical/string key-value attributes for a run.

Column Type Notes
id INTEGER PK autoincrement
run_id TEXT FK → runs
name TEXT param name
value TEXT param value

Unique constraint: (run_id, name).

hierarchy

User-defined level ordering.

Column Type Notes
level_order INTEGER PK 0-indexed
level_name TEXT UNIQUE e.g. "benchmark"

todos

Task notes scoped to global, experiment, or run.

Column Type Notes
id TEXT PK ULID
scope_type TEXT "global", "experiment", "run"
scope_id TEXT nullable (NULL for global)
content TEXT
done INTEGER 0 or 1
priority INTEGER 0=low, 1=mid, 2=high
created_at TEXT ISO 8601
completed_at TEXT nullable

Store Directory Layout

.extract/
├── extract.db                          # SQLite database
├── extract.db-wal                      # WAL journal (auto-managed)
├── extract.db-shm                      # shared memory (auto-managed)
├── config.toml                         # TUI configuration
├── sync.lock                           # present during sync operations
├── artifacts/
│   └── {run_id}/
│       ├── matrices/{name}.npy         # NumPy arrays
│       └── text/{name}.md              # markdown text
└── models/
    └── {model_name}/{version}/         # copied model files

Configuration (config.toml)

Store Setup

[store]

Key Type Default Description
hierarchy string (none) Experiment tree levels separated by >, e.g. "benchmark > model > variant"

View Layout — controls what each TUI panel/view displays

[summary]

Controls the Summary tab in the Detail panel (selected via S).

Key Type Default Description
sections string[] ["runs", "metrics", "tables", "curves"] Display order in detail panel
curve_width int 80 Chart width as % of panel (1-100)
curve_height int auto Chart height in lines (auto-scales by metric count: 12/10/8/6)
curve_smooth bool false Catmull-Rom curve interpolation

[info]

Controls the Info tab in the Detail panel (selected via I) and the Config section in Compare/Diff views. Nested configs are flattened with dot-notation (e.g. method.lora_r, task.num_train_epochs).

Key Type Default Description
fields string[] [] (show all) Glob patterns for which config keys to display

Dots act as path separators so glob semantics apply per segment:

Pattern Matches Does not match
method.* method.name, method.lora_r method.deep.nested
method.** method.name, method.deep.nested model.name
*.name method.name, model.name method.deep.name
**.name method.name, method.deep.name method.lora_r
method.lora_* method.lora_r, method.lora_alpha method.name
method.lora_? method.lora_r method.lora_alpha
{method,model}.* method.name, model.name task.name
!method.parent (excludes method.parent)

Negation patterns (!) exclude matching keys. Combine with positive patterns: ["method.**", "!method.parent"] shows everything under method except parent. When empty (default), all config keys are shown.

[compare]

Controls the Compare view (triggered via c with marked runs).

Key Type Default Description
sections string[] ["pivot", "config", "tables", "curves"] Display order in compare view
curve_width int 50 Chart width as % of panel (1-100)
curve_height int auto Chart height in lines (auto-scales by metric count: 12/10/8/6)

Data Interpretation — controls how metrics and table values are evaluated

[metrics]

Determines metric direction (minimize vs maximize) for ranking, comparison arrows, and improvement highlighting.

Key Type Default Description
minimize string[] [] Metrics where lower is better
maximize string[] [] Metrics where higher is better

Unlisted metrics fall back to name-based heuristics (see Metric Direction).

[[tables.highlight]]

Ordered highlight rules for matrix/table cells. First match wins.

Field Type Description
eq float Exact value match
min float Inclusive lower bound
max float Exclusive upper bound
pattern string Substring match
color string Color name (see below)

Colors: "red", "green", "yellow", "blue", "cyan", "magenta", "white", "black", "darkgray", "orange", "none" (terminal default), or any hex color (e.g. "#ff6600").

Appearance

[theme]

All values are hex color strings. Omitted fields use ANSI 16-color defaults.

Key Default Description
fg "#cdd6f4" Foreground text
bg "#1e1e2e" Background
accent "#89b4fa" Primary accent
accent_dim "#585b70" Dimmed accent
success "#a6e3a1" Success indicators
warning "#f9e2af" Warning indicators
error "#f38ba8" Error indicators
border "#585b70" Unfocused borders
border_focused "#89b4fa" Focused borders

[notifications]

Key Type Default Description
timeout int 3 Auto-dismiss timeout in seconds

Metric Direction

Metric direction (minimize vs maximize) determines ranking, comparison arrows, and improvement highlighting.

Configuration

Override direction for specific metrics in config.toml:

[metrics]
minimize = ["forgetting_rate", "custom_loss"]
maximize = ["custom_score"]

Heuristic Fallback

Metrics not listed in config use name-based heuristics. These patterns are recognized as minimize (lower is better):

loss, error, perplexity, mse, mae, rmse, nll, cer, wer, fid, divergence

All other unlisted metrics default to maximize (higher is better).


Full Usage Example

Assumes config.toml has [store] hierarchy = "benchmark > model > variant".

import numpy as np
from extract import Store

store = Store()

# Create experiment hierarchy
exp = store.experiment({
    "benchmark": "imagenet",
    "model": "resnet50",
    "variant": "lr-sweep"
})

# Context manager approach
with exp.run(config={"lr": 0.001, "bs": 32, "epochs": 50}, name="resnet50-lr1e3", total_steps=50) as run:
    for step in range(50):
        # Streaming curve points — drive the live chart, not the headline summary.
        run.curve(step=step, train_loss=2.3 - step * 0.04, train_acc=step * 0.018)
    # Headline metrics — appear in Summary panel and rankings.
    run.log(final_acc=0.92)
    run.log(arch="resnet50", optimizer="sgd")
    run.log_table("confusion_matrix", np.random.rand(1000, 1000), axes={"rows": "true", "cols": "predicted"})
    run.log_text("notes", "## Observations\nResNet50 with lr=1e-3 converges stably.")
    run.tag("sweep", "production-candidate")
    run.note("Best learning rate in sweep")
    run.todo("Try lr=0.005 next", priority=1)
    run.register_model("resnet50-imagenet", "v1.0", "/tmp/model.pt", framework="pytorch")
    run.derived_from(model="baseline-imagenet", version="v0.9")

# Direct call approach — run spans multiple scopes
run = exp.run(config={"lr": 0.005}, name="resnet50-lr5e3")
run.log(loss=2.1)
run.tag("sweep")
run.finish()          # explicit finalize (idempotent)

# Global TODO
store.todo("Write up results for paper", priority=2)

store.close()

Source Structure

rust/src/
├── main.rs          # entry point, CLI args, main loop
├── app.rs           # AppState, View/Focus enums, all application state
├── db.rs            # SQLite read-only access layer
├── model.rs         # data structs: Experiment, Run, ScalarMetric, Artifact, Model, etc.
├── config.rs        # TOML config parsing, theme/color handling
├── keys.rs          # keybinding constants and key matching (h/l=tab, arrows=nav/cycle)
├── event.rs         # event handling (Key, Tick, Resize)
├── artifact.rs      # NumPy table loading, CellValue handling
└── ui/
    ├── layout.rs    # main layout orchestrator + event dispatcher
    ├── tree.rs      # experiment tree navigator
    ├── detail.rs    # detail panel (Summary/Info tabs)
    ├── dashboard.rs # dashboard view
    ├── summary.rs   # renders runs, metrics, curves, tables
    ├── chart.rs     # line chart rendering
    ├── compare.rs   # compare view (side-by-side runs)
    ├── diff.rs      # diff view (config differences)
    ├── search.rs    # search popup
    ├── popup.rs     # run picker, run browser, delete confirm
    ├── selection.rs # selection window for marked runs
    ├── registry.rs  # model registry view
    ├── lineage.rs   # lineage DAG view
    ├── todo.rs      # TODO management view
    ├── help.rs      # help overlay
    ├── statusbar.rs # status bar
    ├── theme.rs     # theme color/style definitions
    └── heatmap.rs   # heatmap visualization

python/src/extract/
├── __init__.py      # public API: Store, Experiment, Run, sync
├── __main__.py      # CLI entry point (extract tui / sync)
├── store.py         # Store class: DB, hierarchy, experiment creation
├── experiment.py    # Experiment class: run creation, run listing
├── run.py           # Run class: logging, artifacts, models, lineage
├── metrics.py       # helpers: save_npy, load_npy, save_text
├── sync.py          # sync: rsync, tar archives, DB merging
└── mcp.py           # MCP server: 8 read-only tools over stdio