Ensure Trackio works on weird cluster filesystems (Lustre, FSx, GPFS, NFS, WekaFS)

## Context

[Reported via Slack](https://github.com/gradio-app/trackio/issues/555#issue-4473559792) by a user training on a cluster with `/fsx` (WekaFS) as the local Trackio cache:

> I've been hitting a consistent `SIGBUS` crash during distributed training with trackio on our cluster. I traced the python trace down to a known SQLite limitation on distributed file systems: a race condition where a WAL checkpoint calls `ftruncate` while another thread is actively reading via `mmap`. Because cache invalidation isn't atomic across the network for WekaFS, the kernel throws a SIGBUS when the mmap reader hits the truncated page.

This isn't WekaFS-specific — it's a known SQLite caveat on every distributed/network filesystem (Lustre, FSx, GPFS, NFS, CephFS, …). Combine **WAL journal mode** + **memory-mapped reads** + **non-coherent caches** and you get SIGBUS or worse. SQLite's own docs warn against running over network filesystems for exactly this reason.

The user worked around it by `--trackio_space_id ...` (logging directly to the Space, bypassing the local SQLite cache). But the local cache path is the default, and cluster filesystems are the default storage for cluster training jobs, so we'd like Trackio to be safe out of the box there.

Currently, `trackio/sqlite_storage.py:51:

- `journal_mode` defaults to `WAL` locally, `DELETE` on Spaces. Overridable via `TRACKIO_SQLITE_JOURNAL_MODE`.
- `synchronous = NORMAL`, `temp_store = MEMORY`, `cache_size = -20000`.
- `locking_mode = EXCLUSIVE` on Spaces.

What we don't set:

- `PRAGMA mmap_size` — left at SQLite's compile-time default (typically 0 on most builds, but non-zero on some). On a distributed FS, any non-zero value is unsafe.
- No detection of "this path is on a network filesystem" to auto-pick safe pragmas.
- No documentation telling cluster users which knobs to flip.

What we could do:

1. **Default `PRAGMA mmap_size=0`.** Memory-mapped I/O wins ~nothing for Trackio's workload (small, write-heavy, low-locality) and is the direct trigger for the SIGBUS class of bugs on distributed FSes. Make `0` the default everywhere; expose `TRACKIO_SQLITE_MMAP_SIZE` for users who want to opt back in.
2. **Add `TRACKIO_SQLITE_*` env knobs for the other pragmas** we care about — at least `JOURNAL_MODE` (exists), `MMAP_SIZE` (new), `SYNCHRONOUS` (new), `LOCKING_MODE` (new), `TEMP_STORE` (new). Document them in `environment_variables.md`.
3. **Document a recommended cluster setup** in the docs: either point `TRACKIO_DIR` to a node-local disk (e.g. `/tmp` or `\$SLURM_TMPDIR`), or set `TRACKIO_SQLITE_JOURNAL_MODE=DELETE` + `TRACKIO_SQLITE_MMAP_SIZE=0` if local disk isn't available. Add a one-liner heuristic if we want to be proactive (e.g. warn if `TRACKIO_DIR` resolves under a path matching `^/(fsx|lustre|gpfs|nfs|cephfs|weka)` and the user hasn't overridden the pragmas).

## Intersection with #554 (Bucket-as-source-of-truth)

The medium-term architecture in #554 makes this whole class of bugs largely irrelevant for the *write* path: clients buffer in memory and flush parquet shards to a Bucket. There is no shared SQLite file across nodes, no WAL checkpoint contending with another process's mmap reader.


## Long-term: retire the hazard class

The safe-defaults work above is still the right tactical step — a few-line change that unblocks cluster users in days. The strategic follow-up lives in #554: replace SQLite entirely with append-only parquet shards queried via DuckDB, same layout for local and Bucket-backed projects. With no WAL, no `mmap`, no `ftruncate` racing readers, the SIGBUS class of bugs on distributed filesystems goes away by construction rather than by carefully-chosen PRAGMAs. The two don't conflict — #555 buys time, #554 finishes the job.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ensure Trackio works on weird cluster filesystems (Lustre, FSx, GPFS, NFS, WekaFS) #555

Context

Intersection with #554 (Bucket-as-source-of-truth)

Long-term: retire the hazard class

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Ensure Trackio works on weird cluster filesystems (Lustre, FSx, GPFS, NFS, WekaFS) #555

Description

Context

Intersection with #554 (Bucket-as-source-of-truth)

Long-term: retire the hazard class

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions