PERF: [POC] compute skew and kurtosis with SIMD Using Vector Extensions#64582
PERF: [POC] compute skew and kurtosis with SIMD Using Vector Extensions#64582Alvaro-Kothe wants to merge 35 commits into
Conversation
|
#64541 was motivated by something similar. Does doing this in C instead of python make that unnecessary? |
Doing this in C doesn't make #64541 unnecessary. From what I've glimpsed in mesonbuild/meson#13350, using |
a0db203 to
dbec8ab
Compare
|
How much of the perf bump comes from SIMD vs openmp? I suspect this will be harder to get merged with multithreading than without (though im on record as in favor of it). How much effort would it take to get the speedups on Mac? |
I isolated each of them and ran the benchmarks (results in details). On my machine, Vectorization with AVX2 provides the most performance benefit.
I don't have an ARM machine, but I will try to either ask an AI to translate the AVX2 code to ARM, or use a SIMD wrapper. Benchmark SIMD x OpenMP
|
|
@jbrockmendel Seems to be working with macOS. |
986c2da to
e4a2b9f
Compare
|
Theres been discussion of how to opt in/out of parallelism and i think the we're converging to default-on with opt-out via max_workers option. |
This is great! But the way that currently is in this PR, OpenMP is a system depenency. If we compile with this dependency it's necessary that the host have OpenMP, otherwise will have a runtime error. |
|
#64541 is written with cython in mind, but the idea there is to always detect it if present rather than require the users to explicitly opt in |
|
I don't know if Cython provides a way to check at runtime if OpenMP is available, but to show the problem of distributing pandas with OpenMP as a runtime dependency is this: ## Build wheel
$ CC=clang CXX=clang++ python -m build --wheel -Cbuild-dir=build/clang -Csetup-args=-Duse_openmp=true .
## Create a container
$ podman run --rm -it -v $(pwd)/dist:/dist -w /dist docker.io/python:3.14 bash
## create venv
root@dc5d67e6a666:/dist# python -m venv venv
## Install the wheel
root@dc5d67e6a666:/dist# venv/bin/pip install pandas-3.1.0.dev0+422.g69e47c298f-cp314-cp314-linux_x86_64.whl
## print dependencies
root@dc5d67e6a666:/dist# ldd venv/lib64/python3.14/site-packages/pandas/_libs/algos.cpython-314-x86_64-linux-gnu.so
linux-vdso.so.1 (0x00007f66a34cc000)
libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007f66a3233000)
libomp.so => not found
libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007f66a303f000)
/lib64/ld-linux-x86-64.so.2 (0x00007f66a34ce000)
## Import pandas with missing libomp
root@dc5d67e6a666:/dist# venv/bin/python -c 'import pandas'
Traceback (most recent call last):
File "<string>", line 1, in <module>
import pandas
File "/dist/venv/lib/python3.14/site-packages/pandas/__init__.py", line 44, in <module>
import pandas.core.config_init # pyright: ignore[reportUnusedImport] # noqa: F401
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/dist/venv/lib/python3.14/site-packages/pandas/core/config_init.py", line 31, in <module>
from pandas.errors import Pandas4Warning
File "/dist/venv/lib/python3.14/site-packages/pandas/errors/__init__.py", line 12, in <module>
from pandas._libs.tslibs import (
...<3 lines>...
)
File "/dist/venv/lib/python3.14/site-packages/pandas/_libs/__init__.py", line 18, in <module>
from pandas._libs.interval import Interval
File "pandas/_libs/intervaltree.pxi", line 7, in init pandas._libs.interval
ImportError: libomp.so: cannot open shared object file: No such file or directoryFor me it's safer to distribute without OpenMP as a runtime dependency. Another option is to vendor OpenMP and either distribute the libomp shared object or statically link it. |
|
In the example above I purposely built with clang because it links against |
|
Essentially no one manually compiles. If thats the only way to enable openmp, it will go unused. when openmp is unavailable cython.prange falls back to regular-range. I don't know the details of how/when that check occurs. |
I've updated to automatically use OpenMP when available. However, we still need to decide how to handle requiring OpenMP. I came across a helpful discussion in LightGBM on this topic: Several projects also use OpenMP; here’s how they distribute it:
|
b530236 to
e6e10fc
Compare
@jbrockmendel I just verified with |
|
Downloaded one of the linux wheels, libgomp is being bundled in the wheel. $ unzip -l pandas-3.1.0.dev0+491.g0c9957a9bc-cp314-cp314-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl "**/libgo*"
Archive: pandas-3.1.0.dev0+491.g0c9957a9bc-cp314-cp314-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl
Length Date Time Name
--------- ---------- ----- ----
253289 03-23-2026 22:24 pandas.libs/libgomp-e985bcbb.so.1.0.0
--------- -------
253289 1 fileAnd the rpath is correctly set $ unzip pandas-3.1.0.dev0+491.g0c9957a9bc-cp314-cp314-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl "**/libgo*" "**/algos.cp*" -d pandas
$ ldd pandas/pandas/_libs/algos.cpython-314-x86_64-linux-gnu.so
linux-vdso.so.1 (0x00007f43c2181000)
libm.so.6 => /lib64/libm.so.6 (0x00007f43c1e70000)
libgomp-e985bcbb.so.1.0.0 => /home/alvaro/Downloads/pandas/pandas/_libs/../../pandas.libs/libgomp-e985bcbb.so.1.0.0 (0x00007f43c1c00000)
libc.so.6 => /lib64/libc.so.6 (0x00007f43c1a0d000)
/lib64/ld-linux-x86-64.so.2 (0x00007f43c2183000)
libdl.so.2 => /lib64/libdl.so.2 (0x00007f43c1e6c000)
libpthread.so.0 => /lib64/libpthread.so.0 (0x00007f43c1e68000) |
|
On Windows, it vendors Mac doesn't find OpenMP. Probably would have to install it with homebrew. xref: #64541 |
WillAyd
left a comment
There was a problem hiding this comment.
I would also advise dropping openmp, particularly if the most measurable speedup comes from SIMD
| #endif // x86_64 + glibc + target_clones | ||
|
|
||
| /* --- SIMD Implementation --- */ | ||
| #if __has_attribute(ext_vector_type) |
There was a problem hiding this comment.
I would advise against having this detection logic built into pandas - I think support can be more complicated than this.
Meson has a built-in function for detecting SIMD support that we should use instead. See https://mesonbuild.com/Simd-module.html
There was a problem hiding this comment.
I would advise against having this detection logic built into pandas - I think support can be more complicated than this.
I like the vector extensions provided by clang and gcc because it abstracts away the architecture and the cpu capabilities (with runtime dispatch provided by target_clones). So with a single implementation, we have vectorized code for x86, x86_64, arm and powerPC. With the drawback that it isn't supported with MSVC.
Meson has a built-in function for detecting SIMD support that we should use instead. See https://mesonbuild.com/Simd-module.html
Can you clarify what you have in mind? From what I've seen in the usage, we would have to provide an implementation for each target and also a function to check for cpu capability.
There was a problem hiding this comment.
I would also advise dropping openmp, particularly if the most measurable speedup comes from SIMD
The performance increase of both together is considerable, SIMD alone had a performance increase of 2-4x, while together had a performance increase of 3-12x.
I think that a middle ground is to don't build the wheel with OpenMP, or leave OpenMP disabled by default.
There was a problem hiding this comment.
- is there any reason why this has to be all at once?
- ENH: Add OpenMP detection for cython.prange support #64541 is intended to make prange Just Work from cython. Would that also make it Just Work directly in C?
- could we do the simd-part of the implementation in C and iterate over columns using prange in cython? To the extent we can keep logic in cython (without a perf hit), that makes maintenance easier.
There was a problem hiding this comment.
- is there any reason why this has to be all at once?
It doesn't have to be all at once. I can split this PR if necessary.
- ENH: Add OpenMP detection for cython.prange support #64541 is intended to make prange Just Work from cython. Would that also make it Just Work directly in C?
Yes. What matters is compiling with -fopenmp.
- could we do the simd-part of the implementation in C and iterate over columns using prange in cython? To the extent we can keep logic in cython (without a perf hit), that makes maintenance easier.
Yes, it's also possible to choose which one to parallelize. OpenMP, by default, uses a maximum parallelization level of 1. If the outer loop is run in parallel, it won't spawn any more threads if we call this function. If the outer loop don't run in parallel, this function can run in parallel still. It's possible to control it through the if clause.
Just to clarify, this PR only touches the scalar reduction. The performance increase in stat_ops.FrameOps.time_op('skew', 'Int64', 0) was because the compiler decided to call the scalar reduction. I didn't modify the logic from accumulate_moments_axis.
There was a problem hiding this comment.
Can you clarify what you have in mind? From what I've seen in the usage, we would have to provide an implementation for each target and also a function to check for cpu capability.
Hmm OK - I'm not very familiar with that gcc intrinsic; I would check upstream in Meson to see if anyone has used that alongside the simd module. As a convention though, we have only ever really stuck to the C standard for writing C extensions in pandas. That could leave some performance on the table, but we don't have a good infrastructure for heavily customized C/C++ development.
The pattern described in the documentation is fairly common though; you can see that Arrow does something similar with their compute modules:
https://github.com/apache/arrow/tree/main/cpp/src/arrow/compute/kernels
There was a problem hiding this comment.
@WillAyd I did some experimentation with meson's simd module on #64905 and one of the problems that I found is that it doesn't detect neon support (mesonbuild/meson#11209) and it doesn't support AVX512 instructions (mesonbuild/meson#2085).
There was a problem hiding this comment.
I'm not very familiar with that gcc intrinsic; I would check upstream in Meson to see if anyone has used that alongside the simd module.
I didn't manage to find it. It also seems that it cannot be used with the target_clones and target attributes, because all versions of the function must be in the same translation unit.
It may be possible to use the simd module with the vector extensions alone, but will need to manage the naming, multiversioning, linking and runtime dispatch by ourselves.
2459163 to
fe10ef5
Compare
f381ccd to
2521665
Compare
|
Closing in favour of more portable approaches across different compilers. |
Continuation of #64366; Closes pandas-dev/asv-runner#110
This PR increases performance of moments accumulator through parallelization (with opt-in OpenMP) and SIMD (specific for clang and gcc).
-Dopenmp=disabled.x86_64optionpodman run --rm -it -v $(pwd):/src:z -w /src quay.io/pypa/manylinux_2_28_i686 gcc -S -m32 -Ipandas/_libs/include pandas/_libs/src/moments.c -O2 -fverbose-asm -o moments_x86.s):Benchmarks
AVX2 Benchmark
AVX2 + OpenMP Benchmark