Skip to content

Ensure Trackio works on weird cluster filesystems (Lustre, FSx, GPFS, NFS, WekaFS) #555

@abidlabs

Description

@abidlabs

Context

Reported via Slack by a user training on a cluster with /fsx (WekaFS) as the local Trackio cache:

I've been hitting a consistent SIGBUS crash during distributed training with trackio on our cluster. I traced the python trace down to a known SQLite limitation on distributed file systems: a race condition where a WAL checkpoint calls ftruncate while another thread is actively reading via mmap. Because cache invalidation isn't atomic across the network for WekaFS, the kernel throws a SIGBUS when the mmap reader hits the truncated page.

This isn't WekaFS-specific — it's a known SQLite caveat on every distributed/network filesystem (Lustre, FSx, GPFS, NFS, CephFS, …). Combine WAL journal mode + memory-mapped reads + non-coherent caches and you get SIGBUS or worse. SQLite's own docs warn against running over network filesystems for exactly this reason.

The user worked around it by --trackio_space_id ... (logging directly to the Space, bypassing the local SQLite cache). But the local cache path is the default, and cluster filesystems are the default storage for cluster training jobs, so we'd like Trackio to be safe out of the box there.

Currently, `trackio/sqlite_storage.py:51:

  • journal_mode defaults to WAL locally, DELETE on Spaces. Overridable via TRACKIO_SQLITE_JOURNAL_MODE.
  • synchronous = NORMAL, temp_store = MEMORY, cache_size = -20000.
  • locking_mode = EXCLUSIVE on Spaces.

What we don't set:

  • PRAGMA mmap_size — left at SQLite's compile-time default (typically 0 on most builds, but non-zero on some). On a distributed FS, any non-zero value is unsafe.
  • No detection of "this path is on a network filesystem" to auto-pick safe pragmas.
  • No documentation telling cluster users which knobs to flip.

What we could do:

  1. Default PRAGMA mmap_size=0. Memory-mapped I/O wins ~nothing for Trackio's workload (small, write-heavy, low-locality) and is the direct trigger for the SIGBUS class of bugs on distributed FSes. Make 0 the default everywhere; expose TRACKIO_SQLITE_MMAP_SIZE for users who want to opt back in.
  2. Add TRACKIO_SQLITE_* env knobs for the other pragmas we care about — at least JOURNAL_MODE (exists), MMAP_SIZE (new), SYNCHRONOUS (new), LOCKING_MODE (new), TEMP_STORE (new). Document them in environment_variables.md.
  3. Document a recommended cluster setup in the docs: either point TRACKIO_DIR to a node-local disk (e.g. /tmp or \$SLURM_TMPDIR), or set TRACKIO_SQLITE_JOURNAL_MODE=DELETE + TRACKIO_SQLITE_MMAP_SIZE=0 if local disk isn't available. Add a one-liner heuristic if we want to be proactive (e.g. warn if TRACKIO_DIR resolves under a path matching ^/(fsx|lustre|gpfs|nfs|cephfs|weka) and the user hasn't overridden the pragmas).

Intersection with #554 (Bucket-as-source-of-truth)

The medium-term architecture in #554 makes this whole class of bugs largely irrelevant for the write path: clients buffer in memory and flush parquet shards to a Bucket. There is no shared SQLite file across nodes, no WAL checkpoint contending with another process's mmap reader.

Long-term: retire the hazard class

The safe-defaults work above is still the right tactical step — a few-line change that unblocks cluster users in days. The strategic follow-up lives in #554: replace SQLite entirely with append-only parquet shards queried via DuckDB, same layout for local and Bucket-backed projects. With no WAL, no mmap, no ftruncate racing readers, the SIGBUS class of bugs on distributed filesystems goes away by construction rather than by carefully-chosen PRAGMAs. The two don't conflict — #555 buys time, #554 finishes the job.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions