Skip to content

feat: fragmented random sampler#212

Draft
selmanozleyen wants to merge 12 commits into
scverse:mainfrom
selmanozleyen:feat/fragmanted-random-sampler
Draft

feat: fragmented random sampler#212
selmanozleyen wants to merge 12 commits into
scverse:mainfrom
selmanozleyen:feat/fragmanted-random-sampler

Conversation

@selmanozleyen
Copy link
Copy Markdown
Member

The idea is to be able to sample from a list of masks. When categorical sampler is implemented each category will have a FragmantedRandomSampler inside and will process their splits and chunks outputs to provide us categorical sampling.

I wrote _fragmented_random_sampler.py myself but not the unit tests. I will write them soon.

One thing needs assessment is the mask attribute in the Sampler API. FragmentedRandomSampler inherits Sampler but can't support mask. DistributedSampler thinks all Samplers have a mask property.

I would just go with MaskSampler for our old assumptions which takes mask: slice and let DistributedSampler accept that. Only the signature would change here so I am not sure if it would be a breaking change from the DistributedSampler side. If DistributedSampler didn't exist I could've just silently generalize Sampler.mask: list[slice] | slice and let the child classes overload the mask with ChunkSampler.mask: slice.

@codecov
Copy link
Copy Markdown

codecov Bot commented May 26, 2026

Codecov Report

❌ Patch coverage is 97.24771% with 3 lines in your changes missing coverage. Please review.
✅ Project coverage is 91.86%. Comparing base (91a6c3e) to head (42d42a7).

Files with missing lines Patch % Lines
src/annbatch/samplers/_chunk_sampler.py 93.33% 2 Missing ⚠️
...rc/annbatch/samplers/_fragmented_random_sampler.py 98.73% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #212      +/-   ##
==========================================
- Coverage   93.33%   91.86%   -1.48%     
==========================================
  Files          14       15       +1     
  Lines        1111     1193      +82     
==========================================
+ Hits         1037     1096      +59     
- Misses         74       97      +23     
Files with missing lines Coverage Δ
...rc/annbatch/samplers/_fragmented_random_sampler.py 98.73% <98.73%> (ø)
src/annbatch/samplers/_chunk_sampler.py 98.64% <93.33%> (+0.02%) ⬆️

... and 3 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@ilan-gold
Copy link
Copy Markdown
Collaborator

This is an interesting design! It allows us to reuse the this FragmentedSampler class for general making out of multiple ranges, which is nice. A few questions:

What happens here when we have a lot of categories for perfect random sampling? We have a lot of classes and a list[slice] mask for each one which is effectively a list of length-one slices? Could that degrade, memory-wise? If you have a large amount of categories distributed at random (low thousands, let's say 2000, for example), and a lot of obs (let's say 10 million), that is actually quite a bit of memory, no? If each slice were to be 1 byte (which is definitely conservative) that's 20GB (i.e., 2000 * 10_000_000 * 1 byte).

Also, could you highlight why not make Sampler polymorphic over the mask argument and then replace the current ChunkSampler with FragmentedRandomSampler, where the case of mask: slice is converted internally to mask: list[slice] = [mask]? Then we wouldn't need two classes.

@selmanozleyen
Copy link
Copy Markdown
Member Author

Also, could you highlight why not make Sampler polymorphic over the mask argument and then replace the current ChunkSampler with FragmentedRandomSampler, where the case of mask: slice is converted internally to mask: list[slice] = [mask]? Then we wouldn't need two classes.

Hmm interesting. We can't actually put FragmentedRandomSampler because its shuffled and with replacement atm but a FragmentedSampler could work if thats what you mean. We can do all these but I'd like to know what you are ok with regarding the DistributedSampler. Do you think we should extend this list[slice] behaviour to DistributedSampler? (Or we can just error when the mask doesnt have length 1) I wasnt sure if you wanted the PR to spill over the DistributedSampler, thats why I am asking

@selmanozleyen
Copy link
Copy Markdown
Member Author

About the performance: yeah I agree if we fixed len(masks)==1 per category we can have a better implementation for that case but since the Collection groupby'es in a fragmanted way I assumed it would be natural to go with this case. Even if we had one class to do all this categorical sampling, my hunch is, it can't be that better performance wise (as long as we allow fragmanted categories)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants