feat: fragmented random sampler by selmanozleyen · Pull Request #212 · scverse/annbatch

selmanozleyen · 2026-05-26T13:50:02Z

The idea is to be able to sample from a list of masks. When categorical sampler is implemented each category will have a FragmantedRandomSampler inside and will process their splits and chunks outputs to provide us categorical sampling.

I wrote _fragmented_random_sampler.py myself but not the unit tests. I will write them soon.

One thing needs assessment is the mask attribute in the Sampler API. FragmentedRandomSampler inherits Sampler but can't support mask. DistributedSampler thinks all Samplers have a mask property.

I would just go with MaskSampler for our old assumptions which takes mask: slice and let DistributedSampler accept that. Only the signature would change here so I am not sure if it would be a breaking change from the DistributedSampler side. If DistributedSampler didn't exist I could've just silently generalize Sampler.mask: list[slice] | slice and let the child classes overload the mask with ChunkSampler.mask: slice.

codecov · 2026-05-26T13:54:21Z

Codecov Report

❌ Patch coverage is 97.24771% with 3 lines in your changes missing coverage. Please review.
✅ Project coverage is 91.86%. Comparing base (91a6c3e) to head (42d42a7).

Files with missing lines	Patch %	Lines
src/annbatch/samplers/_chunk_sampler.py	93.33%	2 Missing ⚠️
...rc/annbatch/samplers/_fragmented_random_sampler.py	98.73%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #212      +/-   ##
==========================================
- Coverage   93.33%   91.86%   -1.48%     
==========================================
  Files          14       15       +1     
  Lines        1111     1193      +82     
==========================================
+ Hits         1037     1096      +59     
- Misses         74       97      +23

Files with missing lines	Coverage Δ
...rc/annbatch/samplers/_fragmented_random_sampler.py	`98.73% <98.73%> (ø)`
src/annbatch/samplers/_chunk_sampler.py	`98.64% <93.33%> (+0.02%)`	⬆️

... and 3 files with indirect coverage changes

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

ilan-gold · 2026-05-26T20:37:11Z

This is an interesting design! It allows us to reuse the this FragmentedSampler class for general making out of multiple ranges, which is nice. A few questions:

What happens here when we have a lot of categories for perfect random sampling? We have a lot of classes and a list[slice] mask for each one which is effectively a list of length-one slices? Could that degrade, memory-wise? If you have a large amount of categories distributed at random (low thousands, let's say 2000, for example), and a lot of obs (let's say 10 million), that is actually quite a bit of memory, no? If each slice were to be 1 byte (which is definitely conservative) that's 20GB (i.e., 2000 * 10_000_000 * 1 byte).

Also, could you highlight why not make Sampler polymorphic over the mask argument and then replace the current ChunkSampler with FragmentedRandomSampler, where the case of mask: slice is converted internally to mask: list[slice] = [mask]? Then we wouldn't need two classes.

selmanozleyen · 2026-05-27T07:33:00Z

Also, could you highlight why not make Sampler polymorphic over the mask argument and then replace the current ChunkSampler with FragmentedRandomSampler, where the case of mask: slice is converted internally to mask: list[slice] = [mask]? Then we wouldn't need two classes.

Hmm interesting. We can't actually put FragmentedRandomSampler because its shuffled and with replacement atm but a FragmentedSampler could work if thats what you mean. We can do all these but I'd like to know what you are ok with regarding the DistributedSampler. Do you think we should extend this list[slice] behaviour to DistributedSampler? (Or we can just error when the mask doesnt have length 1) I wasnt sure if you wanted the PR to spill over the DistributedSampler, thats why I am asking

selmanozleyen · 2026-05-27T08:01:55Z

About the performance: yeah I agree if we fixed len(masks)==1 per category we can have a better implementation for that case but since the Collection groupby'es in a fragmanted way I assumed it would be natural to go with this case. Even if we had one class to do all this categorical sampling, my hunch is, it can't be that better performance wise (as long as we allow fragmanted categories)

selmanozleyen added 12 commits May 26, 2026 14:07

introduce the new class

4ca82bc

add docstring

00463e8

self._in_mem not needed

e50e78e

simplify the validate method

2485233

simplify value error

9ef0573

arrange assignments

83b3b3a

add comment

1dfa620

fix of self.chunksize

2635cc9

merge tests

5fc6d90

merge some unit tests

12e8530

merge some unit tests

808fb55

rewrite header

42d42a7

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: fragmented random sampler#212

feat: fragmented random sampler#212
selmanozleyen wants to merge 12 commits into
scverse:mainfrom
selmanozleyen:feat/fragmanted-random-sampler

selmanozleyen commented May 26, 2026

Uh oh!

codecov Bot commented May 26, 2026 •

edited

Loading

Uh oh!

ilan-gold commented May 26, 2026

Uh oh!

selmanozleyen commented May 27, 2026

Uh oh!

selmanozleyen commented May 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

selmanozleyen commented May 26, 2026

Uh oh!

codecov Bot commented May 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

ilan-gold commented May 26, 2026

Uh oh!

selmanozleyen commented May 27, 2026

Uh oh!

selmanozleyen commented May 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

codecov Bot commented May 26, 2026 •

edited

Loading