Minor performance improvement in binary fuse build by RaduBerinde · Pull Request #53 · FastFilter/xorfilter

RaduBerinde · 2026-01-14T15:18:10Z

During peeling, we can calculate the third index by XORing all known
indexes instead of looking it up in the little table. This yields a
modest improvement.

On an Apple M1:

❯ benchstat /tmp/before.txt /tmp/after.txt
name                               old time/op    new time/op    delta
BinaryFusePopulate/8/n=1000000-10    29.0ms ± 5%    27.3ms ± 1%  -5.77%  (p=0.000 n=10+9)

name                               old MKeys/s    new MKeys/s    delta
BinaryFusePopulate/8/n=1000000-10      34.5 ± 5%      36.6 ± 1%  +6.06%  (p=0.000 n=10+9)

On an n4d-standard-8 (AMD Turin):

name                              old time/op    new time/op    delta
BinaryFusePopulate/8/n=1000000-8    34.9ms ± 1%    34.4ms ± 1%  -1.44%  (p=0.000 n=10+9)

name                              old MKeys/s    new MKeys/s    delta
BinaryFusePopulate/8/n=1000000-8      28.7 ± 1%      29.1 ± 1%  +1.45%  (p=0.000 n=10+9)

lemire · 2026-01-14T19:55:03Z

@RaduBerinde

Ok. I had an AI write the following python script:

#!/usr/bin/env python3
import subprocess
import statistics
import collections

def run_benchmark():
    """Run the benchmark command and return the output."""
    result = subprocess.run(
        ["go", "test", "-run=^$", "-bench=BenchmarkBinaryFusePopulate"],
        capture_output=True,
        text=True
    )
    return result.stdout

def extract_mkeys_per_sec(output):
    """Extract benchmark name and MKeys/s values from the benchmark output."""
    pairs = []
    lines = output.split('\n')
    for line in lines:
        if 'MKeys/s' in line:
            parts = line.split()
            # The benchmark name is the first part, without the suffix
            name = parts[0].rsplit('-', 1)[0]
            # Find the value before 'MKeys/s'
            for i, part in enumerate(parts):
                if part == 'MKeys/s':
                    try:
                        value = float(parts[i-1])
                        pairs.append((name, value))
                    except (ValueError, IndexError):
                        pass
    return pairs

def main():
    mkeys_dict = collections.defaultdict(list)
    num_runs = 10
    
    print(f"Running benchmark {num_runs} times...")
    
    for i in range(num_runs):
        print(f"Run {i+1}/{num_runs}")
        output = run_benchmark()
        pairs = extract_mkeys_per_sec(output)
        for name, val in pairs:
            mkeys_dict[name].append(val)
    
    print("\n")  # New line after dots
    
    for bits in [8, 16]:
        print(f"## {bits}-bit")
        print("| n | Min | Max | Average | Median |")
        print("|-----|-----|-----|---------|--------|")
        for num in [10000, 100000, 1000000]:
            key = f"BenchmarkBinaryFusePopulate/{bits}/n={num}"
            if key in mkeys_dict and mkeys_dict[key]:
                vals = mkeys_dict[key]
                min_val = min(vals)
                max_val = max(vals)
                avg_val = statistics.mean(vals)
                med_val = statistics.median(vals)
                print(f"| {num} | {min_val:.2f} | {max_val:.2f} | {avg_val:.2f} | {med_val:.2f} |")
            else:
                print(f"| {num} | N/A | N/A | N/A | N/A |")
        print()  # Empty line between tables

if __name__ == "__main__":
    main()

It runs the benchmarks 10 times and gives me some stats about the number of millions of keys per second populated.

Using Go 1.25 and an Apple M4...

For your PR I get...

8-bit

n	Min	Max	Average	Median
10000	79.03	80.52	79.69	79.59
100000	58.44	60.09	59.18	59.22
1000000	48.55	50.28	49.64	49.82

16-bit

n	Min	Max	Average	Median
10000	76.38	78.36	77.28	77.13
100000	53.61	55.28	54.33	54.30
1000000	43.02	45.92	44.96	45.13

From our main branch I get...

8-bit

n	Min	Max	Average	Median
10000	71.69	80.71	78.35	79.72
100000	52.98	60.53	58.04	58.31
1000000	47.28	50.74	49.59	49.77

16-bit

n	Min	Max	Average	Median
10000	72.80	79.98	77.18	77.91
100000	26.54	58.09	52.94	56.08
1000000	46.38	50.00	48.73	49.05

So, interestingly, I am getting no gain whatsoever in the 8-bit case, but a measurable gain in the 16-bit case when we have lots of data.

Could you run my script or the equivalent and see what you are finding ?

RaduBerinde · 2026-01-14T21:11:43Z

The output in the PR description was generated with benchstat (go install github.com/cockroachdb/benchstat@latest; it's forked from golang.org/x/perf/cmd/benchstat, I don't remember what we added). I do something like this:

$ go test . -run=- -bench=BinaryFusePopulate -count 10 |& tee after.txt
$ git checkout HEAD^
$ go test . -run=- -bench=BinaryFusePopulate -count 10 |& tee before.txt
$ benchstat before.txt after.txt

It shows the p-value derived from the Mann-Whitney U-test (probability that you'd get the difference by chance).

Anyway, here's the output of your script on my Apple M1:

This PR, 8-bit

n	Min	Max	Average	Median
10000	44.06	46.18	45.06	45.13
100000	20.26	40.17	37.81	39.77
1000000	34.02	37.09	36.36	36.60

This PR, 16-bit

n	Min	Max	Average	Median
10000	44.94	49.35	47.88	48.68
100000	36.75	39.05	38.64	38.89
1000000	33.09	34.74	34.17	34.41

main, 8-bit

n	Min	Max	Average	Median
10000	21.42	47.20	42.67	44.33
100000	35.98	38.49	37.32	37.12
1000000	33.25	35.33	34.13	33.98

main 16-bit

n	Min	Max	Average	Median
10000	41.11	46.73	44.10	44.21
100000	32.43	38.01	36.62	36.89
1000000	31.51	33.39	32.51	32.47

It's completely possible that M4's improved architecture makes the local table solution just as good. It is odd that we see an improvement only in the 16-bit case, it shouldn't make the change any more or less effective.

I think server machines are more important to look at. I already posted for GCE AMD-based n4d machines.
Here's for an N4 machine (INTEL(R) XEON(R) PLATINUM 8581C CPU @ 2.10GHz):

name                               old MKeys/s    new MKeys/s    delta
BinaryFusePopulate/8/n=10000-8         25.9 ± 1%      25.4 ± 3%  -1.72%  (p=0.003 n=10+8)
BinaryFusePopulate/8/n=100000-8        23.3 ± 1%      23.1 ± 1%  -1.04%  (p=0.001 n=7+9)
BinaryFusePopulate/8/n=1000000-8       17.0 ± 2%      16.9 ± 2%    ~     (p=0.529 n=10+8)
BinaryFusePopulate/16/n=10000-8        27.5 ± 1%      26.8 ± 1%  -2.49%  (p=0.000 n=9+10)
BinaryFusePopulate/16/n=100000-8       23.9 ± 1%      24.1 ± 1%  +0.66%  (p=0.020 n=10+10)
BinaryFusePopulate/16/n=1000000-8      16.8 ± 2%      17.0 ± 2%    ~     (p=0.066 n=9+9)

Overall it looks like a small improvement, but it's close enough that it's not too convincing. No problem whatsoever if you decide to close this.

On a more general note, I tried a bunch of potential improvements but the code looks like it is already exceptionally well optimized. I had another small potential improvement - to skip the if t2hash[index1]&t2hash[index2]&t2hash[index3] == 0 duplicate test in the first iteration.. but it was even less than this PR :)

During peeling, we can calculate the third index by XORing all known indexes instead of looking it up in the little table. This yields a modest improvement. On an Apple M1: ``` ❯ benchstat /tmp/before.txt /tmp/after.txt name old time/op new time/op delta BinaryFusePopulate/8/n=1000000-10 29.0ms ± 5% 27.3ms ± 1% -5.77% (p=0.000 n=10+9) name old MKeys/s new MKeys/s delta BinaryFusePopulate/8/n=1000000-10 34.5 ± 5% 36.6 ± 1% +6.06% (p=0.000 n=10+9) ``` On an n4d-standard-8 (AMD Turin): ``` name old time/op new time/op delta BinaryFusePopulate/8/n=1000000-8 34.9ms ± 1% 34.4ms ± 1% -1.44% (p=0.000 n=10+9) name old MKeys/s new MKeys/s delta BinaryFusePopulate/8/n=1000000-8 28.7 ± 1% 29.1 ± 1% +1.45% (p=0.000 n=10+9) ```

RaduBerinde · 2026-01-14T21:14:29Z

Rebased onto the newest main.

RaduBerinde · 2026-01-14T22:23:05Z

I was experimenting and hit a go bug 😆 golang/go#77181

RaduBerinde · 2026-01-14T23:17:22Z

What do you think of something like this instead: https://github.com/RaduBerinde/xorfilter/pull/new/other-index-2

I get a more significant improvement on n4d (AMD Turin) but a smaller improvement on M1:
M1:

name                               old MKeys/s    new MKeys/s    delta
BinaryFusePopulate/8/n=1000000-10      36.6 ± 1%      37.9 ± 1%  +3.67%  (p=0.000 n=9+10)

N4d:

name                              old MKeys/s    new MKeys/s    delta
BinaryFusePopulate/8/n=1000000-8      28.6 ± 2%      32.2 ± 2%  +12.66%  (p=0.000 n=10+10)

On N4 (Intel):

name                              old MKeys/s    new MKeys/s    delta
BinaryFusePopulate/8/n=1000000-8      20.6 ± 3%      22.9 ± 3%  +11.01%  (p=0.000 n=10+10)

RaduBerinde · 2026-01-15T01:39:00Z

I am working on a branchless version of the above. Let me close this for now.

lemire · 2026-01-15T04:37:36Z

No problem whatsoever if you decide to close this.

Just to be clear, I wasn't about to close this or reject it or any such thing.

:-)

You closed it.

Ok, let us move to the other PR.

RaduBerinde force-pushed the other-index branch from 787d8e2 to d75fdf8 Compare January 14, 2026 21:14

RaduBerinde closed this Jan 15, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Minor performance improvement in binary fuse build#53

Minor performance improvement in binary fuse build#53
RaduBerinde wants to merge 1 commit intoFastFilter:masterfrom
RaduBerinde:other-index

RaduBerinde commented Jan 14, 2026

Uh oh!

lemire commented Jan 14, 2026

Uh oh!

RaduBerinde commented Jan 14, 2026 •

edited

Loading

Uh oh!

RaduBerinde commented Jan 14, 2026

Uh oh!

RaduBerinde commented Jan 14, 2026

Uh oh!

RaduBerinde commented Jan 14, 2026 •

edited

Loading

Uh oh!

RaduBerinde commented Jan 15, 2026

Uh oh!

lemire commented Jan 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

RaduBerinde commented Jan 14, 2026

Uh oh!

lemire commented Jan 14, 2026

8-bit

16-bit

8-bit

16-bit

Uh oh!

RaduBerinde commented Jan 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

This PR, 8-bit

This PR, 16-bit

main, 8-bit

main 16-bit

Uh oh!

RaduBerinde commented Jan 14, 2026

Uh oh!

RaduBerinde commented Jan 14, 2026

Uh oh!

RaduBerinde commented Jan 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

RaduBerinde commented Jan 15, 2026

Uh oh!

lemire commented Jan 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

RaduBerinde commented Jan 14, 2026 •

edited

Loading

RaduBerinde commented Jan 14, 2026 •

edited

Loading