Skip to content

Add 'Two-Bits Bucketing' strategy#37

Merged
janpfeifer merged 1 commit intomainfrom
two-bits-bucketing
Apr 7, 2026
Merged

Add 'Two-Bits Bucketing' strategy#37
janpfeifer merged 1 commit intomainfrom
two-bits-bucketing

Conversation

@janpfeifer
Copy link
Copy Markdown
Contributor

This PR introduces a new bucketing strategy called 'Two-Bits Bucketing' to the tokenizers/bucket package.

Changes:

  • tokenizers/bucket package:
    • Added ByTwoBitBucket(batchSize, minSentenceLength int): Configures the bucketizer to use buckets of sentence-length sized to the next value that can be represented with 2 bits (e.g., 1, 2, 3, 4, 6, 8, 12, 16, ...).
    • Added ByTwoBitBucketBudget(tokensBudget, minSentenceLength int): Similar to ByTwoBitBucket, but adjusts the batch size to fit within a fixed tokens budget.
    • Added TwoBitBucketLen(unpaddedLen int): Helper function to calculate the smallest size >= unpaddedLen that uses only the two highest bits.
  • Documentation: Updated docs/CHANGELOG.md to include the new strategy.
  • Tests: Added comprehensive tests for the new functions in tokenizers/bucket/bucket_test.go.

About Two-Bits Bucketing:

This is a '2-bit semi-log bucketing' where each size is separated from the other by a factor of 1.5 or 1.333 alternatingly (averaging ~1.414 or sqrt(2)). This results in bucket sizes that are 'friendlier' for binary addressing and memory pages while minimizing padding by providing more granular bucket steps than pure powers of 2.

@janpfeifer janpfeifer merged commit 7458693 into main Apr 7, 2026
4 checks passed
@janpfeifer janpfeifer deleted the two-bits-bucketing branch April 7, 2026 09:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant