Skip to content

Stream CsvDataTableStore.getRows from disk lazily and mandate closing the returned stream#8156

Merged
jkschneider merged 1 commit into
mainfrom
stream-data-table-rows-lazy
Jun 30, 2026
Merged

Stream CsvDataTableStore.getRows from disk lazily and mandate closing the returned stream#8156
jkschneider merged 1 commit into
mainfrom
stream-data-table-rows-lazy

Conversation

@jkschneider

@jkschneider jkschneider commented Jun 30, 2026

Copy link
Copy Markdown
Member

What's changed?

CsvDataTableStore.readRows(...) buffered every matching CSV file fully into a List before returning list.stream(), so reading back a large data table held all of its rows in memory at once. It now streams rows lazily from disk via a single Spliterator that keeps one file open at a time:

  • rows are produced on demand, so peak memory is bounded to one row rather than the whole table;
  • the file is closed the moment its last row is read, so a fully-drained stream — how every caller consumes one today — releases its handle with no explicit close;
  • closing the stream early (try-with-resources) also releases the open file.

Because the returned stream now owns a file handle, DataTableStore.getRows(...) is annotated @MustBeClosed. The annotation is already on the compile classpath transitively (via Caffeine); it's added explicitly as a compileOnly dependency since it's CLASS-retention and not needed at runtime. The effect is that IntelliJ flags any call site that consumes the stream without try-with-resources, so a leak can't slip through unnoticed. Every call site in the repo is wrapped accordingly.

What's your motivation?

Recipes — and the hosts that run them — can read their own data tables back, e.g. to export or aggregate them, and those tables can get very large (one row per method/class across a large repository). Buffering the entire table into a List before the consumer sees a single row makes peak memory scale with table size; streaming bounds the store side to one row at a time.

Checklist

  • I've added unit tests to cover both positive and negative cases
  • I've read and applied the recipe conventions and best practices
  • I've used the IntelliJ IDEA auto-formatter on affected files

… the returned stream

readRows(...) buffered every matching CSV file fully into a List before returning list.stream(), so reading back a large data table held all of its rows in memory at once. It now streams rows lazily via a single Spliterator that keeps one file open at a time: rows are produced on demand, the file is closed the moment its last row is read (so a fully-drained stream self-closes), and closing the stream early also releases the open file.

Because the returned stream now owns a file handle, DataTableStore.getRows is annotated @MustBeClosed (error_prone_annotations, added as compileOnly) so callers are flagged if they consume it without try-with-resources. All call sites are wrapped accordingly.

Alternative to #7858.
@github-project-automation github-project-automation Bot moved this to In Progress in OpenRewrite Jun 30, 2026
@jkschneider jkschneider marked this pull request as ready for review June 30, 2026 19:14
@jkschneider jkschneider merged commit df82349 into main Jun 30, 2026
1 check failed
@jkschneider jkschneider deleted the stream-data-table-rows-lazy branch June 30, 2026 19:15
@github-project-automation github-project-automation Bot moved this from In Progress to Done in OpenRewrite Jun 30, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

1 participant