Architecture change: make the Bucket the source of truth, Spaces should always be static

The idea is for the Trackio Client to write **directly logs to Buckets** (in Parquet format) instead of going through the Space and Spaces to always be static and only read from this Parquet Bucket. Besides the significant code deduplication, this would have several benefits as discussed below:

## The Real Motivation

Today writes go through the Space's HTTP API (`/bulk_log`), and the Space owns a SQLite file that flushes to Bucket periodically. The Space being up is on the critical path for log durability. That causes real problems:

- If the Space is paused/cold/broken at the end of a training run, logs sit in a local pending buffer and only drain on the next `trackio.init()` for that project (#544). On ephemeral infra (spot, one-shot CI, scratch boxes) the next `init()` never happens and the data is lost with the machine.
- Cold starts (~1 min on free hardware) and the 30s `finish()` join often miss each other, so we ship users a warning that boils down to "your logs might never sync."

Buckets are the surface HF built for this workload. They don't carry the `<100k files`, `<10k per folder`, `<100 per commit` recommendations that bite git-based repos, and there's no documented per-second cap. Going bucket-first is strictly better on rate limits, and removes the Space as a write dependency.

## Proposed shape

**Bucket = canonical store. Space = read-cache + UI + alert/webhook evaluator.**

Clients write append-only parquet shards directly to the bucket:

```
<project>/runs/<run_id>/
  meta.json
  shards/<writer_id>/<seq>.parquet
  system/<writer_id>/<seq>.parquet
  media/<sha256>.<ext>           (content-addressed)
```

- Each client process picks a random `<writer_id>` at `init()`, owns its own subtree → no key collisions, no coordination, distributed training works for free.
- Shards are immutable, no manifest needed — list-by-prefix is enough.
- Client buffers `log()` calls in memory and flushes a shard on a time/size threshold (e.g. 10s). A training job at 10 batches/sec produces ~2.8k well-sized shards/day instead of ~864k tiny PUTs.
- Space polls shards above a per-writer watermark, merges into its local view, fires alerts/webhooks on observed new rows (dedupe by `log_id`, which is already a uuid).



## Refinement: drop SQLite entirely (local and remote)

The cleanest version of this is to not keep SQLite as a local hot view at all — use the same parquet-shard layout for both local and Bucket-backed projects, queried via DuckDB. Local mode points the writer at `~/.cache/huggingface/trackio/<project>/`, remote at a Bucket; identical code path. That collapses local and remote onto one storage primitive, deletes the connection/journal/locking plumbing in `sqlite_storage.py`, and makes cluster filesystems (Lustre, FSx, GPFS, NFS, WekaFS) safe by construction rather than by carefully-chosen PRAGMAs — see #555. Costs: DuckDB as a runtime dep (~10 MB, replaces stdlib `sqlite3`), mutations reshape slightly (`delete_run` = remove a directory, `rename_run` = rewrite `meta.json`), and a one-shot migration from existing `~/.cache/huggingface/trackio/*.db` files.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Architecture change: make the Bucket the source of truth, Spaces should always be static #554

The Real Motivation

Proposed shape

Refinement: drop SQLite entirely (local and remote)

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Architecture change: make the Bucket the source of truth, Spaces should always be static #554

Description

The Real Motivation

Proposed shape

Refinement: drop SQLite entirely (local and remote)

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions