Skip to content

perf: internal integer indexing#209

Merged
ilan-gold merged 16 commits into
scverse:mainfrom
selmanozleyen:perf/internal-integer-indexing
May 20, 2026
Merged

perf: internal integer indexing#209
ilan-gold merged 16 commits into
scverse:mainfrom
selmanozleyen:perf/internal-integer-indexing

Conversation

@selmanozleyen
Copy link
Copy Markdown
Member

@selmanozleyen selmanozleyen commented May 18, 2026

In this PR we use row indices for the fetching as much as possible instead of slices.

In the future if we bring zarrs native integer indexing the transition will be easier. Also according to @ilan-gold for bigger chunk sizes it has no performance impact and it gives a %20 boost for integer indexing.

ilan-gold and others added 10 commits May 6, 2026 14:26
…ay differ (scverse#207)

* fix: indexing indices and data separately because their chunk grids may differ

* chore: relnote

* chore: move to hatch-vcs

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* version type

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
@selmanozleyen selmanozleyen changed the title Perf: Internal integer indexing perf: internal integer indexing May 18, 2026
Comment thread src/annbatch/loader.py Outdated
b_end = b_start + shape[0]
mask = (global_index >= b_start) & (global_index < b_end)
if mask.any():
offset = 0 if use_original_space else b_start
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is use_original_space used anywhere anymore? I am fine getting rid its use (in generating the indices), but then lets remove the arg because it's dead code

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

true, removed it

Copy link
Copy Markdown
Collaborator

@ilan-gold ilan-gold left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's remove that arg!

@codecov
Copy link
Copy Markdown

codecov Bot commented May 18, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 93.33%. Comparing base (254ad41) to head (d6155ba).

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #209      +/-   ##
==========================================
+ Coverage   93.18%   93.33%   +0.15%     
==========================================
  Files          14       14              
  Lines        1130     1111      -19     
==========================================
- Hits         1053     1037      -16     
+ Misses         77       74       -3     
Files with missing lines Coverage Δ
src/annbatch/loader.py 93.70% <100.00%> (+0.56%) ⬆️
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@ilan-gold ilan-gold added the run-gpu-ci Signal that gpu ci should be run label May 18, 2026
Comment thread src/annbatch/loader.py
dataset_index_to_slices
A lookup of the list-placement index of a dataset to the request slices.
dataset_index_to_rows
A lookup of the list-placement index of a dataset to the sorted row indices to fetch.
Copy link
Copy Markdown
Member Author

@selmanozleyen selmanozleyen May 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A lookup of the list-placement index of a dataset to the sorted row indices to fetch.

@ilan-gold this isn't correct right? Is it sorted row indices?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't think anything is sorted. Let me check

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

YEah, I mean you can double check, but I don't think we have any sorting enforcement. That's why constructing the batches to yield is completely the sampler's responsibility and relies on the sampler knowing its own fetch order

@selmanozleyen
Copy link
Copy Markdown
Member Author

selmanozleyen commented May 19, 2026

I can't access the cluster right now but my worry was creating integers from slices to then creating slices again from those integers. I want to just point out when the number of slices to loop over is small (which I thought was the goal for having slices) main should win right?

So I ran this local isolated benchmark: https://gist.github.com/selmanozleyen/199b73288ce23cd3928e0b5f00425924

The cutoff chunksize for main to win is like below:

preload_nchunks  cutoff
32               32-64
64               64-128
128              64-128
256              64-128
512              64-128
32768            64-128

When the number of datasets is 5, I found these cutoffs to be higher around 64-128. So the main should win when preload_nchunks is low and chunksize is high. Whenever the dataset number is higher main win gets harder.

But I'd consider this a branch win in general.

@selmanozleyen selmanozleyen marked this pull request as ready for review May 19, 2026 12:29
@selmanozleyen selmanozleyen requested a review from ilan-gold May 19, 2026 12:30
Copy link
Copy Markdown
Collaborator

@ilan-gold ilan-gold left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I want to just point out when the number of slices to loop over is small (which I thought was the goal for having slices) main should win

Maybe, but running locally, it seems that the amount of time it takes on small numbers of slices once you get bigger chunks is 1/5th-1/10th regardless of the method. So it isn't really a bottleneck because the absolute time is much smaller regardless. For example, yes, the below case is 3x slower on this branch

❯ python tester.py --n-rows=100_000_000 --n-datasets=1 --preload-nchunks=32 --chunk-sizes=512 
n_rows=100000000 n_datasets=1 preload_nchunks=32 mean_nnz=20.0
chunk_size      rows/request    main_us branch_us       branch/main
512     16384   80.7    264.7   3.28

but compared to the amount of time the chunk_size=1 case takes in absolute numbers

❯ python tester.py --n-rows=20_000_000 --n-datasets=1 --preload-nchunks=4096 --chunk-sizes=1 
n_rows=20000000 n_datasets=1 preload_nchunks=4096 mean_nnz=20.0
chunk_size      rows/request    main_us branch_us       branch/main
1       4096    8516.1  5411.5  0.64

it's irrelevant. Also, your benchmark with row_runs bit can happen in somewhat in parallel to fetching because our _async_array._get_selection releases the GIL, but not sure how much impact that has.

Are you seeing this as well?

Comment thread src/annbatch/loader.py Outdated
@selmanozleyen
Copy link
Copy Markdown
Member Author

Thats very convincing! Thanks! And its kind of mind blowing

@ilan-gold ilan-gold merged commit 91a6c3e into scverse:main May 20, 2026
15 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

run-gpu-ci Signal that gpu ci should be run

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants