You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The idea is for the Trackio Client to write directly logs to Buckets (in Parquet format) instead of going through the Space and Spaces to always be static and only read from this Parquet Bucket. Besides the significant code deduplication, this would have several benefits as discussed below:
The Real Motivation
Today writes go through the Space's HTTP API (/bulk_log), and the Space owns a SQLite file that flushes to Bucket periodically. The Space being up is on the critical path for log durability. That causes real problems:
If the Space is paused/cold/broken at the end of a training run, logs sit in a local pending buffer and only drain on the next trackio.init() for that project (Trackio logs may never sync if Space is unavailable at run end (no retry unless next init) #544). On ephemeral infra (spot, one-shot CI, scratch boxes) the next init() never happens and the data is lost with the machine.
Cold starts (~1 min on free hardware) and the 30s finish() join often miss each other, so we ship users a warning that boils down to "your logs might never sync."
Buckets are the surface HF built for this workload. They don't carry the <100k files, <10k per folder, <100 per commit recommendations that bite git-based repos, and there's no documented per-second cap. Going bucket-first is strictly better on rate limits, and removes the Space as a write dependency.
Each client process picks a random <writer_id> at init(), owns its own subtree → no key collisions, no coordination, distributed training works for free.
Shards are immutable, no manifest needed — list-by-prefix is enough.
Client buffers log() calls in memory and flushes a shard on a time/size threshold (e.g. 10s). A training job at 10 batches/sec produces ~2.8k well-sized shards/day instead of ~864k tiny PUTs.
Space polls shards above a per-writer watermark, merges into its local view, fires alerts/webhooks on observed new rows (dedupe by log_id, which is already a uuid).
Refinement: drop SQLite entirely (local and remote)
The cleanest version of this is to not keep SQLite as a local hot view at all — use the same parquet-shard layout for both local and Bucket-backed projects, queried via DuckDB. Local mode points the writer at ~/.cache/huggingface/trackio/<project>/, remote at a Bucket; identical code path. That collapses local and remote onto one storage primitive, deletes the connection/journal/locking plumbing in sqlite_storage.py, and makes cluster filesystems (Lustre, FSx, GPFS, NFS, WekaFS) safe by construction rather than by carefully-chosen PRAGMAs — see #555. Costs: DuckDB as a runtime dep (~10 MB, replaces stdlib sqlite3), mutations reshape slightly (delete_run = remove a directory, rename_run = rewrite meta.json), and a one-shot migration from existing ~/.cache/huggingface/trackio/*.db files.
The idea is for the Trackio Client to write directly logs to Buckets (in Parquet format) instead of going through the Space and Spaces to always be static and only read from this Parquet Bucket. Besides the significant code deduplication, this would have several benefits as discussed below:
The Real Motivation
Today writes go through the Space's HTTP API (
/bulk_log), and the Space owns a SQLite file that flushes to Bucket periodically. The Space being up is on the critical path for log durability. That causes real problems:trackio.init()for that project (Trackio logs may never sync if Space is unavailable at run end (no retry unless next init) #544). On ephemeral infra (spot, one-shot CI, scratch boxes) the nextinit()never happens and the data is lost with the machine.finish()join often miss each other, so we ship users a warning that boils down to "your logs might never sync."Buckets are the surface HF built for this workload. They don't carry the
<100k files,<10k per folder,<100 per commitrecommendations that bite git-based repos, and there's no documented per-second cap. Going bucket-first is strictly better on rate limits, and removes the Space as a write dependency.Proposed shape
Bucket = canonical store. Space = read-cache + UI + alert/webhook evaluator.
Clients write append-only parquet shards directly to the bucket:
<writer_id>atinit(), owns its own subtree → no key collisions, no coordination, distributed training works for free.log()calls in memory and flushes a shard on a time/size threshold (e.g. 10s). A training job at 10 batches/sec produces ~2.8k well-sized shards/day instead of ~864k tiny PUTs.log_id, which is already a uuid).Refinement: drop SQLite entirely (local and remote)
The cleanest version of this is to not keep SQLite as a local hot view at all — use the same parquet-shard layout for both local and Bucket-backed projects, queried via DuckDB. Local mode points the writer at
~/.cache/huggingface/trackio/<project>/, remote at a Bucket; identical code path. That collapses local and remote onto one storage primitive, deletes the connection/journal/locking plumbing insqlite_storage.py, and makes cluster filesystems (Lustre, FSx, GPFS, NFS, WekaFS) safe by construction rather than by carefully-chosen PRAGMAs — see #555. Costs: DuckDB as a runtime dep (~10 MB, replaces stdlibsqlite3), mutations reshape slightly (delete_run= remove a directory,rename_run= rewritemeta.json), and a one-shot migration from existing~/.cache/huggingface/trackio/*.dbfiles.