Skip to content

Fix chunk indexing for streams with differing start times#12

Merged
cboulay merged 4 commits into
devfrom
cboulay/chunk_indexing
Jun 12, 2026
Merged

Fix chunk indexing for streams with differing start times#12
cboulay merged 4 commits into
devfrom
cboulay/chunk_indexing

Conversation

@cboulay

@cboulay cboulay commented Jun 12, 2026

Copy link
Copy Markdown
Member

Summary

This branch fixes a crash and a silent data-correctness bug that affect NWB files containing streams that do not all start at the same time (for example CereLink Hub2 and NPLAY, where one begins several seconds after the other). Such streams were misaligned against each other on read, and in some cases the iterator crashed outright. The root cause was the same in two places: per-stream chunk offset tables were built relative to each stream's own first sample rather than the shared, file-wide chunk grid, and the rate-only time offset omitted the stream's start time.

The bugs

When a regular (rate-only) stream starts later than the file's overall start, NWBIterator built its chunk_offsets table from the stream's own first sample and used the computed first_chunk value only to truncate the table, never to delay emission. _build_chunk_messages_static then indexed every stream's table with the global chunk_ix starting at 0, so each stream's sample 0 was dumped into global chunk 0 regardless of when it actually started. This produced two failures. First, a temporal-placement bug: a late-starting stream was emitted several chunks too early and presented as simultaneous with streams that genuinely start at the file origin. Second, an offset bug: the rate-only branch computed ts_off + gain * start_idx, omitting the stream's t0, so its message time offsets were low by exactly the stream's start time. Because the truncated table was shorter than the file-wide n_chunks, iterating past the shorter stream's last entry also raised an IndexError and crashed the run.

The same t0 omission existed independently in NWBSlicer.read_by_index. Its rate-only branch reported ts_off + gain * start_idx, which disagrees with NWBClockDrivenProducer's own playback bookkeeping (file_t = t0 + idx / fs). Any stream with a non-zero starting_time read through that path was mis-timed by its start offset.

The fix

_preload now builds regular-stream chunk offsets on the shared global chunk grid, the same way the event-stream branch already did, by computing the sample index at each global chunk boundary and clamping to [0, n_samples]. Offset tables are therefore uniform length n_chunks for every stream, late-starting or early-ending streams contribute empty slices outside their span instead of being shifted, and chunk j now covers the same wall-clock window for all streams. _build_chunk_messages_static adds info.t0 to the rate-only time offset so a late-starting stream is timed against its true start, and retains a defensive bounds guard so a short offset table degrades gracefully instead of crashing the whole chunk. NWBSlicer.read_by_index gets the matching info.t0 correction in its rate-only branch; the timestamped branch already used absolute times and is unchanged.

Tests

Added test_late_starting_stream_stays_aligned, which builds a two-stream file where the second stream starts two seconds in and asserts it is emitted in the chunks matching its real time with correct offsets. Added test_ragged_stream_lengths_do_not_crash, which exercises the defensive bounds guard directly. Added test_read_by_index_offset_includes_starting_time, which checks the slicer offset for a stream with non-zero starting_time. Each new test was confirmed to fail on the pre-fix code and pass after the fix. The full suite passes (134 tests).

cboulay added 3 commits June 12, 2026 14:02
…bounds check at the top of the per-stream loop.
…art times

NWBIteratorState built each regular stream's chunk_offsets table from the
stream's own first sample and computed first_chunk only to truncate the
table, never to delay emission. _build_chunk_messages_static then indexed
those tables with the global chunk_ix, so every stream's sample 0 landed in
global chunk 0 regardless of when it actually started. A stream starting
later than the file origin (e.g. CereLink NPLAY vs Hub2) was thus emitted
several chunks too early, with time offsets missing its start time, and its
shorter offset table crashed with IndexError once chunk_ix passed its end.

- _preload now builds regular-stream offsets on the shared global chunk grid
  (matching the event-stream branch), clamped to [0, n_samples]. Tables are
  uniform length n_chunks and late/early streams contribute empty slices
  outside their span instead of being shifted.
- _build_chunk_messages_static adds info.t0 to the rate-only time offset so a
  late-starting stream is timed against its true start.
- Keep a defensive bounds guard against a short offset table.

Tests: add late-start alignment regression and a defensive short-table test.
read_by_index reported a rate-only stream's time offset as ts_off+gain*idx,
omitting the stream's starting_time. Sample indices are 0-based from the
stream start, so a stream with non-zero starting_time was mis-timed — and
the value disagreed with NWBClockDrivenProducer's own file_t = t0 + idx/fs
bookkeeping. Add info.t0, mirroring the iterator's chunk-offset fix. The
timestamped branch already uses absolute times and is unchanged.

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes misalignment and incorrect time offsets when reading NWB files containing multiple continuous streams that start at different times, and prevents an IndexError crash when per-stream chunk tables are shorter than the file-wide chunk grid.

Changes:

  • Build rate-only stream chunk_offsets on the shared (file-wide) chunk grid so late-starting streams emit empty slices before their start rather than being shifted to chunk 0.
  • Include each stream’s t0 (starting_time) in rate-only time-offset calculations in both NWBAxisArrayIterator chunk emission and NWBSlicer.read_by_index.
  • Add regression tests covering late-starting stream alignment, ragged offset-table bounds guarding, and slicer offsets including starting_time.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 1 comment.

File Description
src/ezmsg/nwb/iterator.py Fixes chunk offset table construction and time-offset computation; adds a defensive guard against short offset tables.
src/ezmsg/nwb/slicer.py Fixes read_by_index time offset for rate-only streams to include starting_time (t0).
tests/test_iterator.py Adds regression tests for late-start alignment and for avoiding crashes with truncated chunk_offsets.
tests/test_slicer.py Adds regression test ensuring read_by_index offset includes a non-zero starting_time.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/ezmsg/nwb/iterator.py
Building regular-stream chunk offsets with np.round assigned each global
chunk boundary to the nearest sample. When chunk_dur is not an integer
multiple of the sample period, that pulls a pre-boundary sample into the
chunk and disagrees with the event/timestamped paths, which use
searchsorted(side="left"). Use ceil with a small epsilon to match
first-sample-at-or-after-boundary semantics and tolerate float drift.

Add a regression test asserting the offsets equal searchsorted(side="left")
for a non-integer samples-per-chunk stream (256 Hz, 0.1 s chunks).

@kylmcgr kylmcgr left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@cboulay cboulay merged commit 520c07f into dev Jun 12, 2026
14 checks passed
@cboulay cboulay deleted the cboulay/chunk_indexing branch June 12, 2026 22:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants