PERF: [POC] compute skew and kurtosis with SIMD Using Vector Extensions by Alvaro-Kothe · Pull Request #64582 · pandas-dev/pandas

Alvaro-Kothe · 2026-03-13T13:55:42Z

Continuation of #64366; Closes pandas-dev/asv-runner#110

This PR increases performance of moments accumulator through parallelization (with opt-in OpenMP) and SIMD (specific for clang and gcc).

OpenMP provides parallelization and it's used if detected and can be disabled with -Dopenmp=disabled.
- xgboost uses openmp, but they vendor it in the wheel
- numpy does something similar, they / (or SciPy) vendor openblas in the wheel.
It's possible to control the OMP threads with threadpoolctl
SIMD only works if compiled with clang or gcc and it relies on vector extensions provided by those compilers.
- For x86_64, there are two versions of this function, one one with AVX2 and the default.
  - The version is chosen at runtime depending of cpu capability.
  - I think that the default is similar to x86_64 option
  A generic CPU with 64-bit extensions, MMX, SSE, SSE2, and FXSR instruction set support.
- The rest possesses a single, but vectorized, version. aarch64 uses NEON, x86 uses SSE2, for example, here is a little bit of the assembly generated for x86 (podman run --rm -it -v $(pwd):/src:z -w /src quay.io/pypa/manylinux_2_28_i686 gcc -S -m32 -Ipandas/_libs/include pandas/_libs/src/moments.c -O2 -fverbose-asm -o moments_x86.s):
```
# pandas/_libs/src/moments.c:89:     v_n += v_n_increment;
movapd	480(%esp), %xmm1	#, tmp600
addpd	768(%esp), %xmm1	# v_n, _506
movapd	496(%esp), %xmm0	#, tmp601
addpd	784(%esp), %xmm0	# v_n, _509
```
MSVC doesn't possess vector extensions, so to provide a SIMD version, we would have to be implemented with intrinsics (for a lot of different architectures and cpu capabilities), use a SIMD wrapper or create a version that hopefully MSVC can auto-vectorize.

Benchmarks

AVX2 Benchmark

Change	Before [`c6ee315`]	After [`69e47c2`] <perf/skew-kurt-omp>	Ratio	Benchmark (Parameter)
-	80.3±0.3μs	71.6±1μs	0.89	groupby.GroupByMethods.time_dtype_as_field('int16', 'skew', 'direct', 1, 'cython')
-	81.8±0.3μs	72.7±2μs	0.89	groupby.GroupByMethods.time_dtype_as_group('int', 'skew', 'direct', 1, 'cython')
-	81.7±0.1μs	72.7±1μs	0.89	groupby.GroupByMethods.time_dtype_as_group('uint', 'skew', 'direct', 1, 'cython')
-	84.0±20μs	73.6±0.7μs	0.88	groupby.GroupByMethods.time_dtype_as_group('int16', 'skew', 'direct', 1, 'cython')
-	83.2±3μs	72.6±1μs	0.87	groupby.GroupByMethods.time_dtype_as_field('int', 'skew', 'direct', 1, 'cython')
-	85.6±2μs	72.9±1μs	0.85	groupby.GroupByMethods.time_dtype_as_group('float', 'skew', 'direct', 1, 'cython')
-	21.2±0.7μs	15.1±0.07μs	0.71	series_methods.NanOps.time_func('skew', 1000, 'boolean')
-	21.9±0.6μs	15.4±0.07μs	0.7	series_methods.NanOps.time_func('kurt', 1000, 'Int64')
-	17.6±0.07μs	12.4±0.2μs	0.7	series_methods.NanOps.time_func('kurt', 1000, 'int64')
-	17.8±0.06μs	12.3±0.4μs	0.69	series_methods.NanOps.time_func('kurt', 1000, 'float64')
-	17.6±0.07μs	12.0±0.5μs	0.68	series_methods.NanOps.time_func('kurt', 1000, 'int32')
-	17.4±0.04μs	11.8±0.3μs	0.68	series_methods.NanOps.time_func('skew', 1000, 'int32')
-	17.6±0.2μs	12.0±0.2μs	0.68	series_methods.NanOps.time_func('skew', 1000, 'int64')
-	17.4±0.03μs	11.8±0.1μs	0.68	series_methods.NanOps.time_func('skew', 1000, 'int8')
-	17.6±0.1μs	11.8±0.1μs	0.67	series_methods.NanOps.time_func('kurt', 1000, 'int8')
-	17.9±0.2μs	12.0±0.3μs	0.67	series_methods.NanOps.time_func('skew', 1000, 'float64')
-	28.1±9μs	15.5±0.5μs	0.55	series_methods.NanOps.time_func('kurt', 1000, 'boolean')
-	4.62±0.02ms	2.07±0.05ms	0.45	stat_ops.FrameOps.time_op('kurt', 'Int64', None)
-	4.52±0.01ms	1.98±0.01ms	0.44	stat_ops.FrameOps.time_op('skew', 'Int64', None)
-	4.47±0.01ms	1.93±0.01ms	0.43	stat_ops.FrameOps.time_op('kurt', 'int', None)
-	4.37±0.04ms	1.82±0ms	0.42	stat_ops.FrameOps.time_op('skew', 'int', None)
-	4.31±0ms	1.70±0.02ms	0.39	stat_ops.FrameOps.time_op('kurt', 'Int64', 0)
-	9.72±0.01ms	3.70±0.02ms	0.38	series_methods.NanOps.time_func('kurt', `1000000`, 'Int64')
-	4.24±0.04ms	1.60±0.01ms	0.38	stat_ops.FrameOps.time_op('skew', 'Int64', 0)
-	9.68±0.03ms	3.47±0.04ms	0.36	series_methods.NanOps.time_func('kurt', `1000000`, 'int64')
-	9.44±0.01ms	3.44±0.01ms	0.36	series_methods.NanOps.time_func('skew', `1000000`, 'Int64')
-	9.39±0.06ms	3.35±0.03ms	0.36	series_methods.NanOps.time_func('skew', `1000000`, 'int64')
-	9.39±0.03ms	3.31±0.01ms	0.35	series_methods.NanOps.time_func('kurt', `1000000`, 'boolean')
-	4.08±0.01ms	1.41±0ms	0.35	stat_ops.FrameOps.time_op('kurt', 'int', 0)
-	3.95±0.01ms	1.39±0.01ms	0.35	stat_ops.FrameOps.time_op('skew', 'int', 0)
-	9.08±0.01ms	3.07±0.02ms	0.34	series_methods.NanOps.time_func('skew', `1000000`, 'boolean')
-	9.12±0.02ms	3.13±0.02ms	0.34	series_methods.NanOps.time_func('skew', `1000000`, 'int32')
-	9.70±0.3ms	3.21±0.03ms	0.33	series_methods.NanOps.time_func('kurt', `1000000`, 'int32')
-	9.32±0.01ms	3.11±0.01ms	0.33	series_methods.NanOps.time_func('kurt', `1000000`, 'int8')
-	9.01±0.01ms	2.98±0.02ms	0.33	series_methods.NanOps.time_func('skew', `1000000`, 'int8')
-	973±4μs	307±1μs	0.32	stat_ops.SeriesOps.time_op('kurt', 'int')
-	939±3μs	298±0.8μs	0.32	stat_ops.SeriesOps.time_op('skew', 'int')
-	3.87±0.02ms	1.17±0.03ms	0.3	stat_ops.FrameMultiIndexOps.time_op('kurt')
-	3.71±0.01ms	1.13±0ms	0.3	stat_ops.FrameMultiIndexOps.time_op('skew')
-	3.84±0.01ms	1.17±0.01ms	0.3	stat_ops.FrameOps.time_op('kurt', 'float', 0)
-	3.69±0.02ms	1.11±0.01ms	0.3	stat_ops.FrameOps.time_op('skew', 'float', 0)
-	8.74±0.01ms	2.44±0.01ms	0.28	series_methods.NanOps.time_func('kurt', `1000000`, 'float64')
-	8.40±0.01ms	2.34±0ms	0.28	series_methods.NanOps.time_func('skew', `1000000`, 'float64')
-	3.67±0ms	1.02±0ms	0.28	stat_ops.FrameOps.time_op('kurt', 'float', None)
-	911±5μs	251±0.7μs	0.28	stat_ops.SeriesMultiIndexOps.time_op('kurt')
-	878±7μs	242±1μs	0.28	stat_ops.SeriesOps.time_op('skew', 'float')
-	3.62±0.09ms	971±3μs	0.27	stat_ops.FrameOps.time_op('skew', 'float', None)
-	884±10μs	242±1μs	0.27	stat_ops.SeriesMultiIndexOps.time_op('skew')
-	910±3μs	249±1μs	0.27	stat_ops.SeriesOps.time_op('kurt', 'float')

AVX2 + OpenMP Benchmark

Change	Before [`c6ee315`]	After [36356568] <perf/skew-kurt-omp>	Ratio	Benchmark (Parameter)
-	81.8±0.4μs	72.1±1μs	0.88	groupby.GroupByMethods.time_dtype_as_field('int', 'skew', 'direct', 1, 'cython')
-	81.1±0.6μs	71.1±0.7μs	0.88	groupby.GroupByMethods.time_dtype_as_field('int16', 'skew', 'direct', 1, 'cython')
-	60.1±2μs	52.3±1μs	0.87	groupby.GroupByMethods.time_dtype_as_field('float', 'skew', 'direct', 1, 'cython')
-	83.3±2μs	71.6±1μs	0.86	groupby.GroupByMethods.time_dtype_as_field('uint', 'skew', 'direct', 1, 'cython')
-	82.5±0.6μs	71.0±0.3μs	0.86	groupby.GroupByMethods.time_dtype_as_group('uint', 'skew', 'direct', 1, 'cython')
-	84.8±2μs	71.7±0.2μs	0.85	groupby.GroupByMethods.time_dtype_as_group('float', 'skew', 'direct', 1, 'cython')
-	85.0±2μs	71.3±0.2μs	0.84	groupby.GroupByMethods.time_dtype_as_group('int', 'skew', 'direct', 1, 'cython')
-	88.3±1μs	71.4±0.4μs	0.81	groupby.GroupByMethods.time_dtype_as_group('int16', 'skew', 'direct', 1, 'cython')
-	21.8±0.3μs	17.4±0.08μs	0.8	series_methods.NanOps.time_func('kurt', 1000, 'Int64')
-	17.6±0.1μs	14.2±0.5μs	0.8	series_methods.NanOps.time_func('kurt', 1000, 'float64')
-	17.7±0.1μs	14.1±0.4μs	0.8	series_methods.NanOps.time_func('kurt', 1000, 'int8')
-	21.4±0.1μs	17.0±0.2μs	0.8	series_methods.NanOps.time_func('skew', 1000, 'Int64')
-	17.6±0.08μs	13.9±0.6μs	0.79	series_methods.NanOps.time_func('kurt', 1000, 'int32')
-	21.6±0.4μs	17.0±0.1μs	0.79	series_methods.NanOps.time_func('skew', 1000, 'boolean')
-	21.9±0.6μs	17.0±0.1μs	0.77	series_methods.NanOps.time_func('kurt', 1000, 'boolean')
-	17.5±0.1μs	13.5±0.3μs	0.77	series_methods.NanOps.time_func('skew', 1000, 'float64')
-	18.2±0.5μs	13.8±0.2μs	0.76	series_methods.NanOps.time_func('kurt', 1000, 'int64')
-	17.5±0.1μs	13.3±0.2μs	0.76	series_methods.NanOps.time_func('skew', 1000, 'int32')
-	17.7±0.08μs	13.5±0.2μs	0.76	series_methods.NanOps.time_func('skew', 1000, 'int64')
-	18.4±0.09μs	13.6±0.1μs	0.74	series_methods.NanOps.time_func('skew', 1000, 'int8')
-	4.50±0.03ms	1.50±0.02ms	0.33	stat_ops.FrameOps.time_op('kurt', 'int', None)
-	4.66±0.1ms	1.47±0.02ms	0.32	stat_ops.FrameOps.time_op('kurt', 'Int64', None)
-	4.37±0.02ms	1.38±0.01ms	0.32	stat_ops.FrameOps.time_op('skew', 'int', None)
-	4.50±0.01ms	1.36±0.01ms	0.3	stat_ops.FrameOps.time_op('skew', 'Int64', None)
-	4.31±0.01ms	1.21±0ms	0.28	stat_ops.FrameOps.time_op('kurt', 'Int64', 0)
-	4.24±0.02ms	1.10±0.01ms	0.26	stat_ops.FrameOps.time_op('skew', 'Int64', 0)
-	9.73±0.04ms	2.20±0ms	0.23	series_methods.NanOps.time_func('kurt', `1000000`, 'Int64')
-	9.42±0.01ms	2.05±0.7ms	0.22	series_methods.NanOps.time_func('skew', `1000000`, 'Int64')
-	4.10±0.02ms	921±3μs	0.22	stat_ops.FrameOps.time_op('kurt', 'int', 0)
-	9.69±0.02ms	2.08±0.02ms	0.21	series_methods.NanOps.time_func('kurt', `1000000`, 'int64')
-	3.98±0.01ms	817±3μs	0.21	stat_ops.FrameOps.time_op('skew', 'int', 0)
-	967±3μs	189±5μs	0.2	stat_ops.SeriesOps.time_op('kurt', 'int')
-	9.42±0.02ms	1.76±0.02ms	0.19	series_methods.NanOps.time_func('kurt', `1000000`, 'boolean')
-	9.44±0.01ms	1.77±0ms	0.19	series_methods.NanOps.time_func('kurt', `1000000`, 'int32')
-	9.37±0.03ms	1.80±0.01ms	0.19	series_methods.NanOps.time_func('skew', `1000000`, 'int64')
-	9.37±0.3ms	1.66±0.02ms	0.18	series_methods.NanOps.time_func('kurt', `1000000`, 'int8')
-	944±4μs	160±0.6μs	0.17	stat_ops.SeriesOps.time_op('skew', 'int')
-	9.11±0.03ms	1.49±0.02ms	0.16	series_methods.NanOps.time_func('skew', `1000000`, 'boolean')
-	9.12±0.03ms	1.49±0ms	0.16	series_methods.NanOps.time_func('skew', `1000000`, 'int32')
-	8.99±0.01ms	1.37±0.01ms	0.15	series_methods.NanOps.time_func('skew', `1000000`, 'int8')
-	3.86±0.03ms	553±7μs	0.14	stat_ops.FrameMultiIndexOps.time_op('kurt')
-	3.81±0ms	544±6μs	0.14	stat_ops.FrameOps.time_op('kurt', 'float', 0)
-	3.72±0.01ms	446±4μs	0.12	stat_ops.FrameMultiIndexOps.time_op('skew')
-	3.69±0.01ms	436±6μs	0.12	stat_ops.FrameOps.time_op('skew', 'float', 0)
-	913±4μs	111±1μs	0.12	stat_ops.SeriesMultiIndexOps.time_op('kurt')
-	911±2μs	110±0.9μs	0.12	stat_ops.SeriesOps.time_op('kurt', 'float')
-	8.73±0.01ms	970±10μs	0.11	series_methods.NanOps.time_func('kurt', `1000000`, 'float64')
-	3.67±0ms	404±6μs	0.11	stat_ops.FrameOps.time_op('kurt', 'float', None)
-	880±0.6μs	83.9±0.8μs	0.1	stat_ops.SeriesMultiIndexOps.time_op('skew')
-	875±1μs	84.5±4μs	0.1	stat_ops.SeriesOps.time_op('skew', 'float')
-	8.41±0.02ms	732±1000μs	0.09	series_methods.NanOps.time_func('skew', `1000000`, 'float64')
-	3.55±0.01ms	299±2μs	0.08	stat_ops.FrameOps.time_op('skew', 'float', None)

jbrockmendel · 2026-03-13T15:31:57Z

#64541 was motivated by something similar. Does doing this in C instead of python make that unnecessary?

Alvaro-Kothe · 2026-03-13T15:57:55Z

#64541 was motivated by something similar. Does doing this in C instead of python make that unnecessary?

Doing this in C doesn't make #64541 unnecessary. From what I've glimpsed in mesonbuild/meson#13350, using meson>=1.5 permits that meson can use OpenMP installed through homebrew when compiling with Apple's clang.

jbrockmendel · 2026-03-18T21:30:51Z

How much of the perf bump comes from SIMD vs openmp? I suspect this will be harder to get merged with multithreading than without (though im on record as in favor of it).

How much effort would it take to get the speedups on Mac?

Alvaro-Kothe · 2026-03-18T22:23:18Z

How much of the perf bump comes from SIMD vs openmp?

I isolated each of them and ran the benchmarks (results in details). On my machine, Vectorization with AVX2 provides the most performance benefit.

How much effort would it take to get the speedups on Mac?

I don't have an ARM machine, but I will try to either ask an AI to translate the AVX2 code to ARM, or use a SIMD wrapper.

Benchmark SIMD x OpenMP

Change	Before [15f19a7f] <perf/simd-only>	After [4619ebbf] <perf/openmp-only>	Ratio	Benchmark (Parameter)
+	11.5±0.07μs	23.4±0.1μs	2.03	series_methods.NanOps.time_func('skew', 1000, 'float64')
+	11.6±0.08μs	23.1±0.1μs	2	series_methods.NanOps.time_func('kurt', 1000, 'int8')
+	11.6±0.07μs	23.1±0.06μs	1.99	series_methods.NanOps.time_func('kurt', 1000, 'int32')
+	12.1±0.3μs	23.5±0.2μs	1.94	series_methods.NanOps.time_func('kurt', 1000, 'int64')
+	11.5±0.05μs	22.3±0.2μs	1.94	series_methods.NanOps.time_func('skew', 1000, 'int64')
+	12.1±0.2μs	23.2±0.05μs	1.92	series_methods.NanOps.time_func('kurt', 1000, 'float64')
+	11.7±0.3μs	22.1±0.1μs	1.88	series_methods.NanOps.time_func('skew', 1000, 'int32')
+	12.0±0.08μs	22.6±0.6μs	1.87	series_methods.NanOps.time_func('skew', 1000, 'int8')
+	14.9±0.1μs	27.3±0.08μs	1.83	series_methods.NanOps.time_func('skew', 1000, 'Int64')
+	14.7±0.1μs	26.9±0.05μs	1.83	series_methods.NanOps.time_func('skew', 1000, 'boolean')
+	15.0±0.3μs	27.2±0.09μs	1.81	series_methods.NanOps.time_func('kurt', 1000, 'Int64')
+	15.1±0.2μs	27.0±0.2μs	1.79	series_methods.NanOps.time_func('kurt', 1000, 'boolean')
+	304±2μs	430±20μs	1.41	stat_ops.SeriesOps.time_op('kurt', 'int')
+	248±1μs	341±2μs	1.38	stat_ops.SeriesMultiIndexOps.time_op('kurt')
+	246±2μs	338±1μs	1.38	stat_ops.SeriesOps.time_op('kurt', 'float')
+	1.64±0.02ms	2.22±0.01ms	1.35	stat_ops.FrameOps.time_op('kurt', 'Int64', 0)
+	239±2μs	323±2μs	1.35	stat_ops.SeriesMultiIndexOps.time_op('skew')
+	294±0.9μs	398±2μs	1.35	stat_ops.SeriesOps.time_op('skew', 'int')
+	239±1μs	320±4μs	1.34	stat_ops.SeriesOps.time_op('skew', 'float')
+	3.16±0.03ms	4.21±0.1ms	1.33	series_methods.NanOps.time_func('kurt', `1000000`, 'boolean')
+	1.00±0ms	1.33±0ms	1.33	stat_ops.FrameOps.time_op('kurt', 'float', None)
+	1.40±0.01ms	1.87±0.01ms	1.33	stat_ops.FrameOps.time_op('kurt', 'int', 0)
+	2.42±0ms	3.17±0.01ms	1.31	series_methods.NanOps.time_func('kurt', `1000000`, 'float64')
+	1.14±0.01ms	1.49±0ms	1.31	stat_ops.FrameOps.time_op('kurt', 'float', 0)
+	3.57±0.06ms	4.63±0.01ms	1.3	series_methods.NanOps.time_func('kurt', `1000000`, 'Int64')
+	1.15±0ms	1.49±0.01ms	1.3	stat_ops.FrameMultiIndexOps.time_op('kurt')
+	1.37±0ms	1.78±0.01ms	1.3	stat_ops.FrameOps.time_op('skew', 'int', 0)
+	1.85±0.01ms	2.38±0.01ms	1.29	stat_ops.FrameOps.time_op('kurt', 'int', None)
+	1.61±0.02ms	2.08±0.01ms	1.29	stat_ops.FrameOps.time_op('skew', 'Int64', 0)
+	966±3μs	1.25±0ms	1.29	stat_ops.FrameOps.time_op('skew', 'float', None)
+	1.09±0ms	1.40±0.01ms	1.28	stat_ops.FrameOps.time_op('skew', 'float', 0)
+	2.29±0.01ms	2.92±0.01ms	1.27	series_methods.NanOps.time_func('skew', `1000000`, 'float64')
+	1.10±0ms	1.40±0ms	1.27	stat_ops.FrameMultiIndexOps.time_op('skew')
+	1.81±0ms	2.30±0.01ms	1.27	stat_ops.FrameOps.time_op('skew', 'int', None)
+	3.22±0.02ms	3.98±0.04ms	1.24	series_methods.NanOps.time_func('kurt', `1000000`, 'int32')
+	3.38±0.01ms	4.20±0ms	1.24	series_methods.NanOps.time_func('skew', `1000000`, 'Int64')
+	3.02±0.01ms	3.76±0.01ms	1.24	series_methods.NanOps.time_func('skew', `1000000`, 'boolean')
+	3.45±0.03ms	4.24±0.01ms	1.23	series_methods.NanOps.time_func('kurt', `1000000`, 'int64')
+	3.10±0.02ms	3.81±0.01ms	1.23	series_methods.NanOps.time_func('kurt', `1000000`, 'int8')
+	2.95±0.01ms	3.60±0.03ms	1.22	series_methods.NanOps.time_func('skew', `1000000`, 'int8')
+	3.06±0.02ms	3.71±0.01ms	1.21	series_methods.NanOps.time_func('skew', `1000000`, 'int32')
+	3.31±0.02ms	4.00±0.01ms	1.21	series_methods.NanOps.time_func('skew', `1000000`, 'int64')
+	2.04±0.01ms	2.45±0.1ms	1.2	stat_ops.FrameOps.time_op('kurt', 'Int64', None)
+	1.96±0.01ms	2.32±0.02ms	1.18	stat_ops.FrameOps.time_op('skew', 'Int64', None)

Alvaro-Kothe · 2026-03-19T10:55:17Z

@jbrockmendel Seems to be working with macOS.

jbrockmendel · 2026-03-22T20:11:11Z

Theres been discussion of how to opt in/out of parallelism and i think the we're converging to default-on with opt-out via max_workers option.

Alvaro-Kothe · 2026-03-22T20:25:05Z

Theres been discussion of how to opt in/out of parallelism and i think the we're converging to default-on with opt-out via max_workers option.

This is great! But the way that currently is in this PR, OpenMP is a system depenency. If we compile with this dependency it's necessary that the host have OpenMP, otherwise will have a runtime error.

jbrockmendel · 2026-03-22T20:31:15Z

#64541 is written with cython in mind, but the idea there is to always detect it if present rather than require the users to explicitly opt in

Alvaro-Kothe · 2026-03-22T21:48:44Z

I don't know if Cython provides a way to check at runtime if OpenMP is available, but to show the problem of distributing pandas with OpenMP as a runtime dependency is this:

  ## Build wheel
$ CC=clang CXX=clang++ python -m build  --wheel -Cbuild-dir=build/clang -Csetup-args=-Duse_openmp=true .
  ## Create a container
$ podman run --rm -it -v $(pwd)/dist:/dist -w /dist docker.io/python:3.14 bash
  ## create venv
root@dc5d67e6a666:/dist# python -m venv venv
  ## Install the wheel
root@dc5d67e6a666:/dist# venv/bin/pip install pandas-3.1.0.dev0+422.g69e47c298f-cp314-cp314-linux_x86_64.whl
  ## print dependencies
root@dc5d67e6a666:/dist# ldd venv/lib64/python3.14/site-packages/pandas/_libs/algos.cpython-314-x86_64-linux-gnu.so
        linux-vdso.so.1 (0x00007f66a34cc000)
        libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007f66a3233000)
        libomp.so => not found
        libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007f66a303f000)
        /lib64/ld-linux-x86-64.so.2 (0x00007f66a34ce000)
   ## Import pandas with missing libomp
root@dc5d67e6a666:/dist# venv/bin/python -c 'import pandas'
Traceback (most recent call last):
  File "<string>", line 1, in <module>
    import pandas
  File "/dist/venv/lib/python3.14/site-packages/pandas/__init__.py", line 44, in <module>
    import pandas.core.config_init  # pyright: ignore[reportUnusedImport] # noqa: F401
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/dist/venv/lib/python3.14/site-packages/pandas/core/config_init.py", line 31, in <module>
    from pandas.errors import Pandas4Warning
  File "/dist/venv/lib/python3.14/site-packages/pandas/errors/__init__.py", line 12, in <module>
    from pandas._libs.tslibs import (
    ...<3 lines>...
    )
  File "/dist/venv/lib/python3.14/site-packages/pandas/_libs/__init__.py", line 18, in <module>
    from pandas._libs.interval import Interval
  File "pandas/_libs/intervaltree.pxi", line 7, in init pandas._libs.interval
ImportError: libomp.so: cannot open shared object file: No such file or directory

For me it's safer to distribute without OpenMP as a runtime dependency. Another option is to vendor OpenMP and either distribute the libomp shared object or statically link it.

Alvaro-Kothe · 2026-03-22T22:00:55Z

In the example above I purposely built with clang because it links against libomp, if I compiled with gcc it would have linked against libgomp which is available in the container:

root@dc5d67e6a666:/dist# find /usr -name 'libgomp.so*' -type f
/usr/lib/x86_64-linux-gnu/libgomp.so.1.0.0

jbrockmendel · 2026-03-23T21:13:48Z

Essentially no one manually compiles. If thats the only way to enable openmp, it will go unused.

when openmp is unavailable cython.prange falls back to regular-range. I don't know the details of how/when that check occurs.

Alvaro-Kothe · 2026-03-23T21:36:26Z

Essentially no one manually compiles. If that's the only way to enable OpenMP, it will go unused.

I've updated to automatically use OpenMP when available. However, we still need to decide how to handle requiring OpenMP. I came across a helpful discussion in LightGBM on this topic:
lightgbm-org/LightGBM#7141.

Several projects also use OpenMP; here’s how they distribute it:

LightGBM: OpenMP is a required system dependency.
XGBoost: Vendors the shared object in the wheel.
PyTorch: Vendors the shared object in the wheel.
scikit-learn: Vendors the shared object in the wheel.

Alvaro-Kothe · 2026-03-23T23:02:34Z

when openmp is unavailable cython.prange falls back to regular-range. I don't know the details of how/when that check occurs.

@jbrockmendel I just verified with prange only. It chooses between range and prange at compile time, but OpenMP is still a runtime dependency, so the error that I've shown in #64582 (comment) is reproducible.

Alvaro-Kothe · 2026-03-23T23:12:12Z

Downloaded one of the linux wheels, libgomp is being bundled in the wheel.

$ unzip -l pandas-3.1.0.dev0+491.g0c9957a9bc-cp314-cp314-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl "**/libgo*"
Archive:  pandas-3.1.0.dev0+491.g0c9957a9bc-cp314-cp314-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl
  Length      Date    Time    Name
---------  ---------- -----   ----
   253289  03-23-2026 22:24   pandas.libs/libgomp-e985bcbb.so.1.0.0
---------                     -------
   253289                     1 file

And the rpath is correctly set

$ unzip pandas-3.1.0.dev0+491.g0c9957a9bc-cp314-cp314-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl "**/libgo*" "**/algos.cp*" -d pandas
$ ldd pandas/pandas/_libs/algos.cpython-314-x86_64-linux-gnu.so
        linux-vdso.so.1 (0x00007f43c2181000)
        libm.so.6 => /lib64/libm.so.6 (0x00007f43c1e70000)
        libgomp-e985bcbb.so.1.0.0 => /home/alvaro/Downloads/pandas/pandas/_libs/../../pandas.libs/libgomp-e985bcbb.so.1.0.0 (0x00007f43c1c00000)
        libc.so.6 => /lib64/libc.so.6 (0x00007f43c1a0d000)
        /lib64/ld-linux-x86-64.so.2 (0x00007f43c2183000)
        libdl.so.2 => /lib64/libdl.so.2 (0x00007f43c1e6c000)
        libpthread.so.0 => /lib64/libpthread.so.0 (0x00007f43c1e68000)

Alvaro-Kothe · 2026-03-23T23:50:13Z

On Windows, it vendors vcomp140-*.dll.

Mac doesn't find OpenMP. Probably would have to install it with homebrew. xref: #64541

WillAyd

I would also advise dropping openmp, particularly if the most measurable speedup comes from SIMD

WillAyd · 2026-03-24T15:25:26Z

+#endif // x86_64 + glibc + target_clones
+
+/* --- SIMD Implementation --- */
+#if __has_attribute(ext_vector_type)


I would advise against having this detection logic built into pandas - I think support can be more complicated than this.

Meson has a built-in function for detecting SIMD support that we should use instead. See https://mesonbuild.com/Simd-module.html

I would advise against having this detection logic built into pandas - I think support can be more complicated than this.

I like the vector extensions provided by clang and gcc because it abstracts away the architecture and the cpu capabilities (with runtime dispatch provided by target_clones). So with a single implementation, we have vectorized code for x86, x86_64, arm and powerPC. With the drawback that it isn't supported with MSVC.

Meson has a built-in function for detecting SIMD support that we should use instead. See https://mesonbuild.com/Simd-module.html

Can you clarify what you have in mind? From what I've seen in the usage, we would have to provide an implementation for each target and also a function to check for cpu capability.

I would also advise dropping openmp, particularly if the most measurable speedup comes from SIMD

The performance increase of both together is considerable, SIMD alone had a performance increase of 2-4x, while together had a performance increase of 3-12x.

I think that a middle ground is to don't build the wheel with OpenMP, or leave OpenMP disabled by default.

is there any reason why this has to be all at once?

ENH: Add OpenMP detection for cython.prange support #64541 is intended to make prange Just Work from cython. Would that also make it Just Work directly in C?

could we do the simd-part of the implementation in C and iterate over columns using prange in cython? To the extent we can keep logic in cython (without a perf hit), that makes maintenance easier.

is there any reason why this has to be all at once?

It doesn't have to be all at once. I can split this PR if necessary.

ENH: Add OpenMP detection for cython.prange support #64541 is intended to make prange Just Work from cython. Would that also make it Just Work directly in C?

Yes. What matters is compiling with -fopenmp.

could we do the simd-part of the implementation in C and iterate over columns using prange in cython? To the extent we can keep logic in cython (without a perf hit), that makes maintenance easier.

Yes, it's also possible to choose which one to parallelize. OpenMP, by default, uses a maximum parallelization level of 1. If the outer loop is run in parallel, it won't spawn any more threads if we call this function. If the outer loop don't run in parallel, this function can run in parallel still. It's possible to control it through the if clause.

Just to clarify, this PR only touches the scalar reduction. The performance increase in stat_ops.FrameOps.time_op('skew', 'Int64', 0) was because the compiler decided to call the scalar reduction. I didn't modify the logic from accumulate_moments_axis.

Can you clarify what you have in mind? From what I've seen in the usage, we would have to provide an implementation for each target and also a function to check for cpu capability.

Hmm OK - I'm not very familiar with that gcc intrinsic; I would check upstream in Meson to see if anyone has used that alongside the simd module. As a convention though, we have only ever really stuck to the C standard for writing C extensions in pandas. That could leave some performance on the table, but we don't have a good infrastructure for heavily customized C/C++ development.

The pattern described in the documentation is fairly common though; you can see that Arrow does something similar with their compute modules:

https://github.com/apache/arrow/tree/main/cpp/src/arrow/compute/kernels

@WillAyd I did some experimentation with meson's simd module on #64905 and one of the problems that I found is that it doesn't detect neon support (mesonbuild/meson#11209) and it doesn't support AVX512 instructions (mesonbuild/meson#2085).

I'm not very familiar with that gcc intrinsic; I would check upstream in Meson to see if anyone has used that alongside the simd module.

I didn't manage to find it. It also seems that it cannot be used with the target_clones and target attributes, because all versions of the function must be in the same translation unit.

It may be possible to use the simd module with the vector extensions alone, but will need to manage the naming, multiversioning, linking and runtime dispatch by ourselves.

Alvaro-Kothe · 2026-04-29T21:46:13Z

Closing in favour of more portable approaches across different compilers.

Alvaro-Kothe mentioned this pull request Mar 13, 2026

BUG: combine all implementations of kurtosis and skewness computations and align results with SciPy #64366

Merged

10 tasks

Alvaro-Kothe force-pushed the perf/skew-kurt-omp branch 3 times, most recently from a0db203 to dbec8ab Compare March 18, 2026 01:30

Alvaro-Kothe mentioned this pull request Mar 18, 2026

Commit 027ea1edde4ee1de97c89cca4f782a00d1955375 pandas-dev/asv-runner#110

Closed

12 tasks

Alvaro-Kothe force-pushed the perf/skew-kurt-omp branch from 986c2da to e4a2b9f Compare March 20, 2026 18:01

Alvaro-Kothe marked this pull request as ready for review March 22, 2026 16:29

Alvaro-Kothe changed the title ~~PERF: [POC] compute skew and kurtosis with multithreading~~ PERF: compute skew and kurtosis with multithreading and SIMD Mar 22, 2026

Alvaro-Kothe added the Build Library building on various platforms label Mar 23, 2026

Alvaro-Kothe force-pushed the perf/skew-kurt-omp branch from b530236 to e6e10fc Compare March 23, 2026 22:18

WillAyd requested changes Mar 24, 2026

View reviewed changes

This was referenced Mar 28, 2026

PERF: Add SIMD instructions with xsimd to reduce moments #64905

Open

PERF: Use SIMD for read_csv C tokenizer #64515

Open

Alvaro-Kothe force-pushed the perf/skew-kurt-omp branch 2 times, most recently from 2459163 to fe10ef5 Compare April 3, 2026 02:18

Alvaro-Kothe added 27 commits April 29, 2026 18:41

fix: fix musl build

e35e337

fix: remove unnecessary undef

7c5b6e9

chore: remove unused include

39fcf5f

fix: check for vector attribute

0406565

fix: improve typedef for different platforms

bc951a4

chore: clarify vectorized in gcc and undefine v_select

c2c832a

perf: vectorize mask handling

1d7e108

chore: remove inlines in favor of compiler discretion

40fd153

chore: comments

7d3f286

fix: fix loop bounds and avoid unnecessary round

6f9224e

chore: add comments after endif

f3ba6ab

refactor: use size_t instead of signed integer for array size

18566d2

fix: cast type to size_t to avoid conversion warnings

9254c3c

add note about performance

933db46

feat: make OpenMP optional and disabled by default

5acaa4c

refactor: use feature instead of boolean for openmp

d62ac23

perf: use OpenMP automatically

bc43aaa

fix: use meson.options to fix the build error

baf66c4

docs(moments): remove outdated comment about MSVC

893796f

fix: don't force compiler for vector extension

57d3cb1

chore: better variable names in aggregations.pyx

10e14e5

refactor: simplify diff

1f0953c

perf: remove OpenMP

f6c2e2d

fix: use long instead of int64_t for apple clang

58ba947

fix: fix some -Wconversion warnings

7c2e83e

fix: fix types for 32 bit

6269bdc

revert test changes

2521665

Alvaro-Kothe force-pushed the perf/skew-kurt-omp branch from f381ccd to 2521665 Compare April 29, 2026 21:45

Alvaro-Kothe closed this Apr 29, 2026

Uh oh!

Conversation

Alvaro-Kothe commented Mar 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Benchmarks

Uh oh!

jbrockmendel commented Mar 13, 2026

Uh oh!

Alvaro-Kothe commented Mar 13, 2026

Uh oh!

jbrockmendel commented Mar 18, 2026

Uh oh!

Alvaro-Kothe commented Mar 18, 2026

Uh oh!

Alvaro-Kothe commented Mar 19, 2026

Uh oh!

jbrockmendel commented Mar 22, 2026

Uh oh!

Alvaro-Kothe commented Mar 22, 2026

Uh oh!

jbrockmendel commented Mar 22, 2026

Uh oh!

Alvaro-Kothe commented Mar 22, 2026

Uh oh!

Alvaro-Kothe commented Mar 22, 2026

Uh oh!

jbrockmendel commented Mar 23, 2026

Uh oh!

Alvaro-Kothe commented Mar 23, 2026

Uh oh!

Alvaro-Kothe commented Mar 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Alvaro-Kothe commented Mar 23, 2026

Uh oh!

Alvaro-Kothe commented Mar 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

WillAyd left a comment

Choose a reason for hiding this comment

Uh oh!

WillAyd Mar 24, 2026

Choose a reason for hiding this comment

Uh oh!

Alvaro-Kothe Mar 24, 2026

Choose a reason for hiding this comment

Uh oh!

Alvaro-Kothe Mar 24, 2026

Choose a reason for hiding this comment

Uh oh!

jbrockmendel Mar 24, 2026

Choose a reason for hiding this comment

Uh oh!

Alvaro-Kothe Mar 24, 2026

Choose a reason for hiding this comment

Uh oh!

WillAyd Mar 24, 2026

Choose a reason for hiding this comment

Uh oh!

Alvaro-Kothe Mar 28, 2026

Choose a reason for hiding this comment

Uh oh!

Alvaro-Kothe Apr 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Alvaro-Kothe commented Apr 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Alvaro-Kothe commented Mar 13, 2026 •

edited

Loading

Alvaro-Kothe commented Mar 23, 2026 •

edited

Loading

Alvaro-Kothe commented Mar 23, 2026 •

edited

Loading

Alvaro-Kothe Apr 3, 2026 •

edited

Loading