A high-performance prime sieve in Rust, inspired by Vortex-Based Mathematics (VBM). Uses wheel-30 factorization — a generalization of the VBM "doubling circuit" filter — combined with segmented sieving, persistent per-prime streams, pre-sieve patterns, and multi-threaded parallelism.
Benchmarked on a single machine (results will vary by hardware):
| Limit | Single-thread | Multi-thread |
|---|---|---|
| 1,000,000 | 0.12 ms | — |
| 10,000,000 | 2.0 ms | 1.8 ms |
| 100,000,000 | 26 ms | 6.9 ms |
| 1,000,000,000 | 287 ms | 59 ms |
| 10,000,000,000 | 3.1 s | 577 ms |
Benchmarked against primal (the most popular Rust prime sieve), primes, and a naive boolean sieve:
Counting primes (π(n)):
| N | Vortex (single) | Vortex (parallel) | primal | naive | primes crate |
|---|---|---|---|---|---|
| 1,000,000 | 117 µs | 114 µs | 96 µs | 977 µs | 46.6 ms |
| 10,000,000 | 2.02 ms | 1.80 ms | 638 µs | 26.0 ms | 637 ms |
| 100,000,000 | 25.8 ms | 6.89 ms | 8.44 ms | — | — |
| 1,000,000,000 | 288 ms | 59.0 ms | 127 ms | — | — |
Enumerating primes:
| N | Vortex | primal | primes crate |
|---|---|---|---|
| 1,000,000 | 689 µs | 281 µs | 49.7 ms |
| 10,000,000 | 7.34 ms | 2.31 ms | 684 ms |
| 100,000,000 | 94.5 ms | 38.5 ms | — |
Key takeaways:
- At 100M+, Vortex parallel counting beats primal: 1.2x faster at 1e8, 2.2x faster at 1e9
- Single-threaded, primal's tighter inner loop wins at smaller limits (~3x at 1e7)
- Both crush simpler implementations: 400-500x faster than the primes crate
- Enumeration has room to improve — primal's extraction loop is ~2.5x faster
Run the benchmarks yourself: RUSTFLAGS="-C target-cpu=native" cargo bench
# Build with native CPU optimizations (recommended)
RUSTFLAGS="-C target-cpu=native" cargo build --release
# Run benchmarks + VBM digital root analysis
RUSTFLAGS="-C target-cpu=native" cargo run --release
# Run tests
cargo testuse vortex_sieve::{count_primes, count_primes_parallel, sieve_primes};
// Count primes up to 1 billion (single-threaded)
let count = count_primes(1_000_000_000);
assert_eq!(count, 50_847_534);
// Count primes up to 1 billion (multi-threaded)
let count = count_primes_parallel(1_000_000_000);
assert_eq!(count, 50_847_534);
// Enumerate all primes up to 100
let primes = sieve_primes(100);
assert_eq!(primes, vec![2, 3, 5, 7, 11, 13, 17, 19, 23, 29, 31, 37, 41, 43,
47, 53, 59, 61, 67, 71, 73, 79, 83, 89, 97]);Vortex-Based Mathematics (Marko Rodin) observes that the digits 1–9 form distinct groups under mod-9 arithmetic:
- Doubling circuit: {1, 2, 4, 8, 7, 5} — double any number, take its digital root, and you cycle through these six digits forever.
- Trinity axis: {3, 6, 9} — these numbers form a closed group. 3 and 6 oscillate; 9 is the axis.
- Polar pairs: (1,8), (2,7), (4,5) — mirror symmetries where each pair sums to 9.
The key insight for prime sieving: primes greater than 3 only ever have digital roots on the doubling circuit {1, 2, 4, 5, 7, 8}. The 3-6-9 axis is completely empty (except for the prime 3 itself), because any number with digital root 3, 6, or 9 is divisible by 3.
Running the built-in analysis confirms this — primes distribute evenly across all six doubling circuit positions:
Doubling circuit (primes live here):
DR=1: 13,063 primes
DR=2: 13,099 primes
DR=4: 13,070 primes
DR=5: 13,068 primes
DR=7: 13,098 primes
DR=8: 13,099 primes
Trinity axis (primes excluded):
DR=3: 1 prime (only the prime 3 itself)
DR=6: 0 primes
DR=9: 0 primes
This "doubling circuit filter" is a special case of a well-established number theory technique called wheel factorization. The idea generalizes:
| Wheel | Primorial | Candidate Residues | Density | VBM Analogy |
|---|---|---|---|---|
| Wheel-6 | 2 × 3 = 6 | {1, 5} → 2 of 6 | 33% | Doubling circuit (mod 9) |
| Wheel-30 | 2 × 3 × 5 = 30 | {1,7,11,13,17,19,23,29} → 8 of 30 | 27% | Generalized vortex filter |
| Wheel-210 | 2 × 3 × 5 × 7 = 210 | 48 of 210 | 23% | Extended vortex filter |
By choosing wheel-30, we eliminate all multiples of 2, 3, and 5 upfront, leaving only 8 candidate residues per 30 numbers — which pack neatly into 1 byte per 30 numbers (1 bit per candidate).
Each byte in the sieve represents 30 consecutive integers. The 8 bits correspond to the 8 residues coprime to 30: {1, 7, 11, 13, 17, 19, 23, 29}. A set bit means "possibly prime"; a cleared bit means "composite."
Precomputed lookup tables handle the modular arithmetic at compile time (Crandall & Pomerance §3.2):
RESIDUES: the 8 coprime residuesRESIDUE_TO_BIT: maps a residue to its bit position (or 0xFF for non-wheel residues)BIT_MASK: byte masks for clearing individual bitsPRODUCT_RESIDUE[i][j]: for each pair of wheel residue classes, which bit their product lands on — eliminates all modular arithmetic from the sieve hot loopCOFACTOR_INDEX[pi][ci]: for a prime in classpi, which cofactor classjproduces a composite in classci— used bymarking_offsets()for zero-cost stream initialization
The sieve processes the number line in 256 KB segments (~7.8M numbers each), sized to fit in L2 cache. This prevents cache thrashing that would occur with a monolithic sieve array.
Persistent per-prime streams (Riesel Ch.1 optimization): each sieving prime's 8 marking positions are computed once at initialization, then advanced across segments by simple addition. This avoids recomputing modular offsets per segment — a key improvement over naive segmented sieves.
// Per-prime state: 8 (absolute_byte, bit_mask) pairs
// Computed once, advanced across segments
struct PrimeStreams {
p_step: usize,
streams: [(u64, u8); 8],
}Three public functions:
| Function | Purpose | Best For |
|---|---|---|
count_primes(limit) |
Returns π(n), single-threaded | Quick counts, small limits |
count_primes_parallel(limit) |
Returns π(n), multi-threaded via rayon | Large limits (> 10M) |
sieve_primes(limit) |
Returns Vec<u64> of all primes |
When you need the actual primes |
Before the main sieve loop, a repeating pattern for the smallest wheel primes (7, 11, 13) is precomputed. The pattern has period 7 × 11 × 13 = 1,001 bytes (fits comfortably in L1 cache) and is stamped into each segment via memcpy, eliminating ~58% of remaining composites without any per-element work.
The hot loop that marks composites operates entirely in byte-index space (stepping by p bytes) rather than number space (stepping by 30p numbers). This eliminates a division per iteration. The loop uses unsafe get_unchecked_mut to skip bounds checks after the index is validated upfront.
// No division in the hot loop — step by p bytes
while idx < sieve_len {
unsafe { *sieve.get_unchecked_mut(idx) &= mask; }
idx += p_step; // just an add
}When only counting primes (not enumerating), set bits are tallied using count_ones() on u64-aligned chunks, which compiles to a single POPCNT instruction on modern x86 CPUs. This counts 64 candidates per instruction.
Uses Newton's method for exact integer square root computation, avoiding floating-point rounding errors that can occur with (n as f64).sqrt() for large values near 2^64.
count_primes_parallel divides the number line into independent segments and processes them in parallel using rayon. Each thread gets its own sieve buffer — no synchronization needed during sieving.
The algorithms draw from several key texts in computational number theory:
-
Crandall & Pomerance, Prime Numbers: A Computational Perspective (2nd ed., 2005) — Section 3.2 covers sieving to recognize primes with wheel factorization. The
PRODUCT_RESIDUEtable implements their approach of precomputing all residue-class interactions at compile time to eliminate modular arithmetic from the inner loop. -
Riesel, Prime Numbers and Computer Methods for Factorization (2nd ed., 1994) — Chapter 1 covers the Sieve of Eratosthenes with compact prime tables. The wheel-30 bit-packing (1 byte = 30 numbers) follows Riesel's observation that eliminating multiples of 2, 3, and 5 gives 8 candidates per 30 numbers, fitting naturally into a byte. The persistent stream approach avoids recomputing per-segment offsets.
-
Programming Praxis — Blog posts on segmented sieves and Pritchard's wheel sieve informed the segment-size tuning and pre-sieve pattern strategy.
# Debug build (for development)
cargo build
# Release build with full optimizations
RUSTFLAGS="-C target-cpu=native" cargo build --release
# Run criterion benchmarks
RUSTFLAGS="-C target-cpu=native" cargo benchThe Cargo.toml release profile enables:
opt-level = 3— maximum optimizationlto = "fat"— full link-time optimization across all cratescodegen-units = 1— single codegen unit for better inlining
The RUSTFLAGS="-C target-cpu=native" flag enables CPU-specific instructions (AVX2, POPCNT, etc.) for your machine.
# Run all tests
cargo test
# Run tests in release mode (faster, catches different class of bugs)
cargo test --releaseThe test suite includes:
- Reference comparison: results checked against a simple trial-division sieve
- Known values: π(10) = 4, π(100) = 25, π(1,000) = 168, π(10,000) = 1,229, π(100,000) = 9,592, π(1,000,000) = 78,498, π(10,000,000) = 664,579
- Cross-validation:
count_primesvssieve_primesproduce identical results - Parallel correctness:
count_primes_parallelmatches single-threaded count - Sortedness: enumerated primes are in strictly increasing order
- Edge cases: limits 0 through 7
- Integer square root: exact for all inputs including u64::MAX
- Wheel tables: all residues are coprime to 30, product table entries are valid
- Marko Rodin, Vortex-Based Mathematics — the original observation about digital root patterns
- Richard Crandall & Carl Pomerance, Prime Numbers: A Computational Perspective (2nd ed., Springer, 2005) — sieving algorithms and wheel factorization theory
- Hans Riesel, Prime Numbers and Computer Methods for Factorization (2nd ed., Birkhäuser, 1994) — compact sieve implementations and prime table compression
- J.S. Milne, Algebraic Number Theory (v3.08, 2020) — foundational number theory
- Sieve of Eratosthenes (Wikipedia)
- Wheel factorization (Wikipedia)
- primesieve by Kim Walisch — the state-of-the-art reference implementation
- Programming Praxis — segmented sieve, Sieve of Atkin, Pritchard's wheel sieve exercises
- Paul Pritchard, "Explaining the Wheel Sieve" (1982)
MIT