Fix chunk indexing for streams with differing start times by cboulay · Pull Request #12 · ezmsg-org/ezmsg-nwb

cboulay · 2026-06-12T18:53:46Z

Summary

This branch fixes a crash and a silent data-correctness bug that affect NWB files containing streams that do not all start at the same time (for example CereLink Hub2 and NPLAY, where one begins several seconds after the other). Such streams were misaligned against each other on read, and in some cases the iterator crashed outright. The root cause was the same in two places: per-stream chunk offset tables were built relative to each stream's own first sample rather than the shared, file-wide chunk grid, and the rate-only time offset omitted the stream's start time.

The bugs

When a regular (rate-only) stream starts later than the file's overall start, NWBIterator built its chunk_offsets table from the stream's own first sample and used the computed first_chunk value only to truncate the table, never to delay emission. _build_chunk_messages_static then indexed every stream's table with the global chunk_ix starting at 0, so each stream's sample 0 was dumped into global chunk 0 regardless of when it actually started. This produced two failures. First, a temporal-placement bug: a late-starting stream was emitted several chunks too early and presented as simultaneous with streams that genuinely start at the file origin. Second, an offset bug: the rate-only branch computed ts_off + gain * start_idx, omitting the stream's t0, so its message time offsets were low by exactly the stream's start time. Because the truncated table was shorter than the file-wide n_chunks, iterating past the shorter stream's last entry also raised an IndexError and crashed the run.

The same t0 omission existed independently in NWBSlicer.read_by_index. Its rate-only branch reported ts_off + gain * start_idx, which disagrees with NWBClockDrivenProducer's own playback bookkeeping (file_t = t0 + idx / fs). Any stream with a non-zero starting_time read through that path was mis-timed by its start offset.

The fix

_preload now builds regular-stream chunk offsets on the shared global chunk grid, the same way the event-stream branch already did, by computing the sample index at each global chunk boundary and clamping to [0, n_samples]. Offset tables are therefore uniform length n_chunks for every stream, late-starting or early-ending streams contribute empty slices outside their span instead of being shifted, and chunk j now covers the same wall-clock window for all streams. _build_chunk_messages_static adds info.t0 to the rate-only time offset so a late-starting stream is timed against its true start, and retains a defensive bounds guard so a short offset table degrades gracefully instead of crashing the whole chunk. NWBSlicer.read_by_index gets the matching info.t0 correction in its rate-only branch; the timestamped branch already used absolute times and is unchanged.

Tests

Added test_late_starting_stream_stays_aligned, which builds a two-stream file where the second stream starts two seconds in and asserts it is emitted in the chunks matching its real time with correct offsets. Added test_ragged_stream_lengths_do_not_crash, which exercises the defensive bounds guard directly. Added test_read_by_index_offset_includes_starting_time, which checks the slicer offset for a stream with non-zero starting_time. Each new test was confirmed to fail on the pre-fix code and pass after the fix. The full suite passes (134 tests).

…bounds check at the top of the per-stream loop.

…art times NWBIteratorState built each regular stream's chunk_offsets table from the stream's own first sample and computed first_chunk only to truncate the table, never to delay emission. _build_chunk_messages_static then indexed those tables with the global chunk_ix, so every stream's sample 0 landed in global chunk 0 regardless of when it actually started. A stream starting later than the file origin (e.g. CereLink NPLAY vs Hub2) was thus emitted several chunks too early, with time offsets missing its start time, and its shorter offset table crashed with IndexError once chunk_ix passed its end. - _preload now builds regular-stream offsets on the shared global chunk grid (matching the event-stream branch), clamped to [0, n_samples]. Tables are uniform length n_chunks and late/early streams contribute empty slices outside their span instead of being shifted. - _build_chunk_messages_static adds info.t0 to the rate-only time offset so a late-starting stream is timed against its true start. - Keep a defensive bounds guard against a short offset table. Tests: add late-start alignment regression and a defensive short-table test.

read_by_index reported a rate-only stream's time offset as ts_off+gain*idx, omitting the stream's starting_time. Sample indices are 0-based from the stream start, so a stream with non-zero starting_time was mis-timed — and the value disagreed with NWBClockDrivenProducer's own file_t = t0 + idx/fs bookkeeping. Add info.t0, mirroring the iterator's chunk-offset fix. The timestamped branch already uses absolute times and is unchanged.

Copilot

Pull request overview

This PR fixes misalignment and incorrect time offsets when reading NWB files containing multiple continuous streams that start at different times, and prevents an IndexError crash when per-stream chunk tables are shorter than the file-wide chunk grid.

Changes:

Build rate-only stream chunk_offsets on the shared (file-wide) chunk grid so late-starting streams emit empty slices before their start rather than being shifted to chunk 0.
Include each stream’s t0 (starting_time) in rate-only time-offset calculations in both NWBAxisArrayIterator chunk emission and NWBSlicer.read_by_index.
Add regression tests covering late-starting stream alignment, ragged offset-table bounds guarding, and slicer offsets including starting_time.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 1 comment.

File	Description
`src/ezmsg/nwb/iterator.py`	Fixes chunk offset table construction and time-offset computation; adds a defensive guard against short offset tables.
`src/ezmsg/nwb/slicer.py`	Fixes `read_by_index` time offset for rate-only streams to include `starting_time` (`t0`).
`tests/test_iterator.py`	Adds regression tests for late-start alignment and for avoiding crashes with truncated `chunk_offsets`.
`tests/test_slicer.py`	Adds regression test ensuring `read_by_index` offset includes a non-zero `starting_time`.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Building regular-stream chunk offsets with np.round assigned each global chunk boundary to the nearest sample. When chunk_dur is not an integer multiple of the sample period, that pulls a pre-boundary sample into the chunk and disagrees with the event/timestamped paths, which use searchsorted(side="left"). Use ceil with a small epsilon to match first-sample-at-or-after-boundary semantics and tolerate float drift. Add a regression test asserting the offsets equal searchsorted(side="left") for a non-integer samples-per-chunk stream (256 Hz, 0.1 s chunks).

kylmcgr

LGTM

cboulay added 3 commits June 12, 2026 14:02

A stream's chunk_offsets table can be shorter than n_chunks; added a …

5696719

…bounds check at the top of the per-stream loop.

cboulay requested review from Copilot and kylmcgr June 12, 2026 18:53

Copilot started reviewing on behalf of cboulay June 12, 2026 18:54 View session

Copilot AI reviewed Jun 12, 2026

View reviewed changes

Comment thread src/ezmsg/nwb/iterator.py

kylmcgr approved these changes Jun 12, 2026

View reviewed changes

cboulay merged commit 520c07f into dev Jun 12, 2026
14 checks passed

cboulay deleted the cboulay/chunk_indexing branch June 12, 2026 22:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix chunk indexing for streams with differing start times#12

Fix chunk indexing for streams with differing start times#12
cboulay merged 4 commits into
devfrom
cboulay/chunk_indexing

cboulay commented Jun 12, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

kylmcgr left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

cboulay commented Jun 12, 2026

Summary

The bugs

The fix

Tests

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

kylmcgr left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants