Minor performance improvement in binary fuse build#53
Minor performance improvement in binary fuse build#53RaduBerinde wants to merge 1 commit intoFastFilter:masterfrom
Conversation
|
Ok. I had an AI write the following python script: #!/usr/bin/env python3
import subprocess
import statistics
import collections
def run_benchmark():
"""Run the benchmark command and return the output."""
result = subprocess.run(
["go", "test", "-run=^$", "-bench=BenchmarkBinaryFusePopulate"],
capture_output=True,
text=True
)
return result.stdout
def extract_mkeys_per_sec(output):
"""Extract benchmark name and MKeys/s values from the benchmark output."""
pairs = []
lines = output.split('\n')
for line in lines:
if 'MKeys/s' in line:
parts = line.split()
# The benchmark name is the first part, without the suffix
name = parts[0].rsplit('-', 1)[0]
# Find the value before 'MKeys/s'
for i, part in enumerate(parts):
if part == 'MKeys/s':
try:
value = float(parts[i-1])
pairs.append((name, value))
except (ValueError, IndexError):
pass
return pairs
def main():
mkeys_dict = collections.defaultdict(list)
num_runs = 10
print(f"Running benchmark {num_runs} times...")
for i in range(num_runs):
print(f"Run {i+1}/{num_runs}")
output = run_benchmark()
pairs = extract_mkeys_per_sec(output)
for name, val in pairs:
mkeys_dict[name].append(val)
print("\n") # New line after dots
for bits in [8, 16]:
print(f"## {bits}-bit")
print("| n | Min | Max | Average | Median |")
print("|-----|-----|-----|---------|--------|")
for num in [10000, 100000, 1000000]:
key = f"BenchmarkBinaryFusePopulate/{bits}/n={num}"
if key in mkeys_dict and mkeys_dict[key]:
vals = mkeys_dict[key]
min_val = min(vals)
max_val = max(vals)
avg_val = statistics.mean(vals)
med_val = statistics.median(vals)
print(f"| {num} | {min_val:.2f} | {max_val:.2f} | {avg_val:.2f} | {med_val:.2f} |")
else:
print(f"| {num} | N/A | N/A | N/A | N/A |")
print() # Empty line between tables
if __name__ == "__main__":
main()It runs the benchmarks 10 times and gives me some stats about the number of millions of keys per second populated. Using Go 1.25 and an Apple M4... For your PR I get... 8-bit
16-bit
From our main branch I get... 8-bit
16-bit
So, interestingly, I am getting no gain whatsoever in the 8-bit case, but a measurable gain in the 16-bit case when we have lots of data. Could you run my script or the equivalent and see what you are finding ? |
|
The output in the PR description was generated with It shows the p-value derived from the Mann-Whitney U-test (probability that you'd get the difference by chance). Anyway, here's the output of your script on my Apple M1: This PR, 8-bit
This PR, 16-bit
main, 8-bit
main 16-bit
It's completely possible that M4's improved architecture makes the local table solution just as good. It is odd that we see an improvement only in the 16-bit case, it shouldn't make the change any more or less effective. I think server machines are more important to look at. I already posted for GCE AMD-based Overall it looks like a small improvement, but it's close enough that it's not too convincing. No problem whatsoever if you decide to close this. On a more general note, I tried a bunch of potential improvements but the code looks like it is already exceptionally well optimized. I had another small potential improvement - to skip the |
During peeling, we can calculate the third index by XORing all known indexes instead of looking it up in the little table. This yields a modest improvement. On an Apple M1: ``` ❯ benchstat /tmp/before.txt /tmp/after.txt name old time/op new time/op delta BinaryFusePopulate/8/n=1000000-10 29.0ms ± 5% 27.3ms ± 1% -5.77% (p=0.000 n=10+9) name old MKeys/s new MKeys/s delta BinaryFusePopulate/8/n=1000000-10 34.5 ± 5% 36.6 ± 1% +6.06% (p=0.000 n=10+9) ``` On an n4d-standard-8 (AMD Turin): ``` name old time/op new time/op delta BinaryFusePopulate/8/n=1000000-8 34.9ms ± 1% 34.4ms ± 1% -1.44% (p=0.000 n=10+9) name old MKeys/s new MKeys/s delta BinaryFusePopulate/8/n=1000000-8 28.7 ± 1% 29.1 ± 1% +1.45% (p=0.000 n=10+9) ```
787d8e2 to
d75fdf8
Compare
|
Rebased onto the newest main. |
|
I was experimenting and hit a go bug 😆 golang/go#77181 |
|
What do you think of something like this instead: https://github.com/RaduBerinde/xorfilter/pull/new/other-index-2 I get a more significant improvement on n4d (AMD Turin) but a smaller improvement on M1: N4d: On N4 (Intel): |
|
I am working on a branchless version of the above. Let me close this for now. |
Just to be clear, I wasn't about to close this or reject it or any such thing. :-) You closed it. Ok, let us move to the other PR. |
During peeling, we can calculate the third index by XORing all known
indexes instead of looking it up in the little table. This yields a
modest improvement.
On an Apple M1:
On an n4d-standard-8 (AMD Turin):