Skip to content

PERF: [POC] compute skew and kurtosis with SIMD Using Vector Extensions#64582

Closed
Alvaro-Kothe wants to merge 35 commits into
pandas-dev:mainfrom
Alvaro-Kothe:perf/skew-kurt-omp
Closed

PERF: [POC] compute skew and kurtosis with SIMD Using Vector Extensions#64582
Alvaro-Kothe wants to merge 35 commits into
pandas-dev:mainfrom
Alvaro-Kothe:perf/skew-kurt-omp

Conversation

@Alvaro-Kothe
Copy link
Copy Markdown
Member

@Alvaro-Kothe Alvaro-Kothe commented Mar 13, 2026

Continuation of #64366; Closes pandas-dev/asv-runner#110

This PR increases performance of moments accumulator through parallelization (with opt-in OpenMP) and SIMD (specific for clang and gcc).


  • OpenMP provides parallelization and it's used if detected and can be disabled with -Dopenmp=disabled.
    • xgboost uses openmp, but they vendor it in the wheel
    • numpy does something similar, they / (or SciPy) vendor openblas in the wheel.
  • It's possible to control the OMP threads with threadpoolctl
  • SIMD only works if compiled with clang or gcc and it relies on vector extensions provided by those compilers.
    • For x86_64, there are two versions of this function, one one with AVX2 and the default.
      • The version is chosen at runtime depending of cpu capability.
      • I think that the default is similar to x86_64 option

      A generic CPU with 64-bit extensions, MMX, SSE, SSE2, and FXSR instruction set support.

    • The rest possesses a single, but vectorized, version. aarch64 uses NEON, x86 uses SSE2, for example, here is a little bit of the assembly generated for x86 (podman run --rm -it -v $(pwd):/src:z -w /src quay.io/pypa/manylinux_2_28_i686 gcc -S -m32 -Ipandas/_libs/include pandas/_libs/src/moments.c -O2 -fverbose-asm -o moments_x86.s):
    # pandas/_libs/src/moments.c:89:     v_n += v_n_increment;
    movapd	480(%esp), %xmm1	#, tmp600
    addpd	768(%esp), %xmm1	# v_n, _506
    movapd	496(%esp), %xmm0	#, tmp601
    addpd	784(%esp), %xmm0	# v_n, _509
  • MSVC doesn't possess vector extensions, so to provide a SIMD version, we would have to be implemented with intrinsics (for a lot of different architectures and cpu capabilities), use a SIMD wrapper or create a version that hopefully MSVC can auto-vectorize.

Benchmarks

AVX2 Benchmark

Change Before [c6ee315] After [69e47c2] <perf/skew-kurt-omp> Ratio Benchmark (Parameter)
- 80.3±0.3μs 71.6±1μs 0.89 groupby.GroupByMethods.time_dtype_as_field('int16', 'skew', 'direct', 1, 'cython')
- 81.8±0.3μs 72.7±2μs 0.89 groupby.GroupByMethods.time_dtype_as_group('int', 'skew', 'direct', 1, 'cython')
- 81.7±0.1μs 72.7±1μs 0.89 groupby.GroupByMethods.time_dtype_as_group('uint', 'skew', 'direct', 1, 'cython')
- 84.0±20μs 73.6±0.7μs 0.88 groupby.GroupByMethods.time_dtype_as_group('int16', 'skew', 'direct', 1, 'cython')
- 83.2±3μs 72.6±1μs 0.87 groupby.GroupByMethods.time_dtype_as_field('int', 'skew', 'direct', 1, 'cython')
- 85.6±2μs 72.9±1μs 0.85 groupby.GroupByMethods.time_dtype_as_group('float', 'skew', 'direct', 1, 'cython')
- 21.2±0.7μs 15.1±0.07μs 0.71 series_methods.NanOps.time_func('skew', 1000, 'boolean')
- 21.9±0.6μs 15.4±0.07μs 0.7 series_methods.NanOps.time_func('kurt', 1000, 'Int64')
- 17.6±0.07μs 12.4±0.2μs 0.7 series_methods.NanOps.time_func('kurt', 1000, 'int64')
- 17.8±0.06μs 12.3±0.4μs 0.69 series_methods.NanOps.time_func('kurt', 1000, 'float64')
- 17.6±0.07μs 12.0±0.5μs 0.68 series_methods.NanOps.time_func('kurt', 1000, 'int32')
- 17.4±0.04μs 11.8±0.3μs 0.68 series_methods.NanOps.time_func('skew', 1000, 'int32')
- 17.6±0.2μs 12.0±0.2μs 0.68 series_methods.NanOps.time_func('skew', 1000, 'int64')
- 17.4±0.03μs 11.8±0.1μs 0.68 series_methods.NanOps.time_func('skew', 1000, 'int8')
- 17.6±0.1μs 11.8±0.1μs 0.67 series_methods.NanOps.time_func('kurt', 1000, 'int8')
- 17.9±0.2μs 12.0±0.3μs 0.67 series_methods.NanOps.time_func('skew', 1000, 'float64')
- 28.1±9μs 15.5±0.5μs 0.55 series_methods.NanOps.time_func('kurt', 1000, 'boolean')
- 4.62±0.02ms 2.07±0.05ms 0.45 stat_ops.FrameOps.time_op('kurt', 'Int64', None)
- 4.52±0.01ms 1.98±0.01ms 0.44 stat_ops.FrameOps.time_op('skew', 'Int64', None)
- 4.47±0.01ms 1.93±0.01ms 0.43 stat_ops.FrameOps.time_op('kurt', 'int', None)
- 4.37±0.04ms 1.82±0ms 0.42 stat_ops.FrameOps.time_op('skew', 'int', None)
- 4.31±0ms 1.70±0.02ms 0.39 stat_ops.FrameOps.time_op('kurt', 'Int64', 0)
- 9.72±0.01ms 3.70±0.02ms 0.38 series_methods.NanOps.time_func('kurt', 1000000, 'Int64')
- 4.24±0.04ms 1.60±0.01ms 0.38 stat_ops.FrameOps.time_op('skew', 'Int64', 0)
- 9.68±0.03ms 3.47±0.04ms 0.36 series_methods.NanOps.time_func('kurt', 1000000, 'int64')
- 9.44±0.01ms 3.44±0.01ms 0.36 series_methods.NanOps.time_func('skew', 1000000, 'Int64')
- 9.39±0.06ms 3.35±0.03ms 0.36 series_methods.NanOps.time_func('skew', 1000000, 'int64')
- 9.39±0.03ms 3.31±0.01ms 0.35 series_methods.NanOps.time_func('kurt', 1000000, 'boolean')
- 4.08±0.01ms 1.41±0ms 0.35 stat_ops.FrameOps.time_op('kurt', 'int', 0)
- 3.95±0.01ms 1.39±0.01ms 0.35 stat_ops.FrameOps.time_op('skew', 'int', 0)
- 9.08±0.01ms 3.07±0.02ms 0.34 series_methods.NanOps.time_func('skew', 1000000, 'boolean')
- 9.12±0.02ms 3.13±0.02ms 0.34 series_methods.NanOps.time_func('skew', 1000000, 'int32')
- 9.70±0.3ms 3.21±0.03ms 0.33 series_methods.NanOps.time_func('kurt', 1000000, 'int32')
- 9.32±0.01ms 3.11±0.01ms 0.33 series_methods.NanOps.time_func('kurt', 1000000, 'int8')
- 9.01±0.01ms 2.98±0.02ms 0.33 series_methods.NanOps.time_func('skew', 1000000, 'int8')
- 973±4μs 307±1μs 0.32 stat_ops.SeriesOps.time_op('kurt', 'int')
- 939±3μs 298±0.8μs 0.32 stat_ops.SeriesOps.time_op('skew', 'int')
- 3.87±0.02ms 1.17±0.03ms 0.3 stat_ops.FrameMultiIndexOps.time_op('kurt')
- 3.71±0.01ms 1.13±0ms 0.3 stat_ops.FrameMultiIndexOps.time_op('skew')
- 3.84±0.01ms 1.17±0.01ms 0.3 stat_ops.FrameOps.time_op('kurt', 'float', 0)
- 3.69±0.02ms 1.11±0.01ms 0.3 stat_ops.FrameOps.time_op('skew', 'float', 0)
- 8.74±0.01ms 2.44±0.01ms 0.28 series_methods.NanOps.time_func('kurt', 1000000, 'float64')
- 8.40±0.01ms 2.34±0ms 0.28 series_methods.NanOps.time_func('skew', 1000000, 'float64')
- 3.67±0ms 1.02±0ms 0.28 stat_ops.FrameOps.time_op('kurt', 'float', None)
- 911±5μs 251±0.7μs 0.28 stat_ops.SeriesMultiIndexOps.time_op('kurt')
- 878±7μs 242±1μs 0.28 stat_ops.SeriesOps.time_op('skew', 'float')
- 3.62±0.09ms 971±3μs 0.27 stat_ops.FrameOps.time_op('skew', 'float', None)
- 884±10μs 242±1μs 0.27 stat_ops.SeriesMultiIndexOps.time_op('skew')
- 910±3μs 249±1μs 0.27 stat_ops.SeriesOps.time_op('kurt', 'float')

AVX2 + OpenMP Benchmark

Change Before [c6ee315] After [36356568] <perf/skew-kurt-omp> Ratio Benchmark (Parameter)
- 81.8±0.4μs 72.1±1μs 0.88 groupby.GroupByMethods.time_dtype_as_field('int', 'skew', 'direct', 1, 'cython')
- 81.1±0.6μs 71.1±0.7μs 0.88 groupby.GroupByMethods.time_dtype_as_field('int16', 'skew', 'direct', 1, 'cython')
- 60.1±2μs 52.3±1μs 0.87 groupby.GroupByMethods.time_dtype_as_field('float', 'skew', 'direct', 1, 'cython')
- 83.3±2μs 71.6±1μs 0.86 groupby.GroupByMethods.time_dtype_as_field('uint', 'skew', 'direct', 1, 'cython')
- 82.5±0.6μs 71.0±0.3μs 0.86 groupby.GroupByMethods.time_dtype_as_group('uint', 'skew', 'direct', 1, 'cython')
- 84.8±2μs 71.7±0.2μs 0.85 groupby.GroupByMethods.time_dtype_as_group('float', 'skew', 'direct', 1, 'cython')
- 85.0±2μs 71.3±0.2μs 0.84 groupby.GroupByMethods.time_dtype_as_group('int', 'skew', 'direct', 1, 'cython')
- 88.3±1μs 71.4±0.4μs 0.81 groupby.GroupByMethods.time_dtype_as_group('int16', 'skew', 'direct', 1, 'cython')
- 21.8±0.3μs 17.4±0.08μs 0.8 series_methods.NanOps.time_func('kurt', 1000, 'Int64')
- 17.6±0.1μs 14.2±0.5μs 0.8 series_methods.NanOps.time_func('kurt', 1000, 'float64')
- 17.7±0.1μs 14.1±0.4μs 0.8 series_methods.NanOps.time_func('kurt', 1000, 'int8')
- 21.4±0.1μs 17.0±0.2μs 0.8 series_methods.NanOps.time_func('skew', 1000, 'Int64')
- 17.6±0.08μs 13.9±0.6μs 0.79 series_methods.NanOps.time_func('kurt', 1000, 'int32')
- 21.6±0.4μs 17.0±0.1μs 0.79 series_methods.NanOps.time_func('skew', 1000, 'boolean')
- 21.9±0.6μs 17.0±0.1μs 0.77 series_methods.NanOps.time_func('kurt', 1000, 'boolean')
- 17.5±0.1μs 13.5±0.3μs 0.77 series_methods.NanOps.time_func('skew', 1000, 'float64')
- 18.2±0.5μs 13.8±0.2μs 0.76 series_methods.NanOps.time_func('kurt', 1000, 'int64')
- 17.5±0.1μs 13.3±0.2μs 0.76 series_methods.NanOps.time_func('skew', 1000, 'int32')
- 17.7±0.08μs 13.5±0.2μs 0.76 series_methods.NanOps.time_func('skew', 1000, 'int64')
- 18.4±0.09μs 13.6±0.1μs 0.74 series_methods.NanOps.time_func('skew', 1000, 'int8')
- 4.50±0.03ms 1.50±0.02ms 0.33 stat_ops.FrameOps.time_op('kurt', 'int', None)
- 4.66±0.1ms 1.47±0.02ms 0.32 stat_ops.FrameOps.time_op('kurt', 'Int64', None)
- 4.37±0.02ms 1.38±0.01ms 0.32 stat_ops.FrameOps.time_op('skew', 'int', None)
- 4.50±0.01ms 1.36±0.01ms 0.3 stat_ops.FrameOps.time_op('skew', 'Int64', None)
- 4.31±0.01ms 1.21±0ms 0.28 stat_ops.FrameOps.time_op('kurt', 'Int64', 0)
- 4.24±0.02ms 1.10±0.01ms 0.26 stat_ops.FrameOps.time_op('skew', 'Int64', 0)
- 9.73±0.04ms 2.20±0ms 0.23 series_methods.NanOps.time_func('kurt', 1000000, 'Int64')
- 9.42±0.01ms 2.05±0.7ms 0.22 series_methods.NanOps.time_func('skew', 1000000, 'Int64')
- 4.10±0.02ms 921±3μs 0.22 stat_ops.FrameOps.time_op('kurt', 'int', 0)
- 9.69±0.02ms 2.08±0.02ms 0.21 series_methods.NanOps.time_func('kurt', 1000000, 'int64')
- 3.98±0.01ms 817±3μs 0.21 stat_ops.FrameOps.time_op('skew', 'int', 0)
- 967±3μs 189±5μs 0.2 stat_ops.SeriesOps.time_op('kurt', 'int')
- 9.42±0.02ms 1.76±0.02ms 0.19 series_methods.NanOps.time_func('kurt', 1000000, 'boolean')
- 9.44±0.01ms 1.77±0ms 0.19 series_methods.NanOps.time_func('kurt', 1000000, 'int32')
- 9.37±0.03ms 1.80±0.01ms 0.19 series_methods.NanOps.time_func('skew', 1000000, 'int64')
- 9.37±0.3ms 1.66±0.02ms 0.18 series_methods.NanOps.time_func('kurt', 1000000, 'int8')
- 944±4μs 160±0.6μs 0.17 stat_ops.SeriesOps.time_op('skew', 'int')
- 9.11±0.03ms 1.49±0.02ms 0.16 series_methods.NanOps.time_func('skew', 1000000, 'boolean')
- 9.12±0.03ms 1.49±0ms 0.16 series_methods.NanOps.time_func('skew', 1000000, 'int32')
- 8.99±0.01ms 1.37±0.01ms 0.15 series_methods.NanOps.time_func('skew', 1000000, 'int8')
- 3.86±0.03ms 553±7μs 0.14 stat_ops.FrameMultiIndexOps.time_op('kurt')
- 3.81±0ms 544±6μs 0.14 stat_ops.FrameOps.time_op('kurt', 'float', 0)
- 3.72±0.01ms 446±4μs 0.12 stat_ops.FrameMultiIndexOps.time_op('skew')
- 3.69±0.01ms 436±6μs 0.12 stat_ops.FrameOps.time_op('skew', 'float', 0)
- 913±4μs 111±1μs 0.12 stat_ops.SeriesMultiIndexOps.time_op('kurt')
- 911±2μs 110±0.9μs 0.12 stat_ops.SeriesOps.time_op('kurt', 'float')
- 8.73±0.01ms 970±10μs 0.11 series_methods.NanOps.time_func('kurt', 1000000, 'float64')
- 3.67±0ms 404±6μs 0.11 stat_ops.FrameOps.time_op('kurt', 'float', None)
- 880±0.6μs 83.9±0.8μs 0.1 stat_ops.SeriesMultiIndexOps.time_op('skew')
- 875±1μs 84.5±4μs 0.1 stat_ops.SeriesOps.time_op('skew', 'float')
- 8.41±0.02ms 732±1000μs 0.09 series_methods.NanOps.time_func('skew', 1000000, 'float64')
- 3.55±0.01ms 299±2μs 0.08 stat_ops.FrameOps.time_op('skew', 'float', None)

@jbrockmendel
Copy link
Copy Markdown
Member

#64541 was motivated by something similar. Does doing this in C instead of python make that unnecessary?

@Alvaro-Kothe
Copy link
Copy Markdown
Member Author

#64541 was motivated by something similar. Does doing this in C instead of python make that unnecessary?

Doing this in C doesn't make #64541 unnecessary. From what I've glimpsed in mesonbuild/meson#13350, using meson>=1.5 permits that meson can use OpenMP installed through homebrew when compiling with Apple's clang.

@jbrockmendel
Copy link
Copy Markdown
Member

How much of the perf bump comes from SIMD vs openmp? I suspect this will be harder to get merged with multithreading than without (though im on record as in favor of it).

How much effort would it take to get the speedups on Mac?

@Alvaro-Kothe
Copy link
Copy Markdown
Member Author

How much of the perf bump comes from SIMD vs openmp?

I isolated each of them and ran the benchmarks (results in details). On my machine, Vectorization with AVX2 provides the most performance benefit.

How much effort would it take to get the speedups on Mac?

I don't have an ARM machine, but I will try to either ask an AI to translate the AVX2 code to ARM, or use a SIMD wrapper.

Benchmark SIMD x OpenMP

Change Before [15f19a7f] <perf/simd-only> After [4619ebbf] <perf/openmp-only> Ratio Benchmark (Parameter)
+ 11.5±0.07μs 23.4±0.1μs 2.03 series_methods.NanOps.time_func('skew', 1000, 'float64')
+ 11.6±0.08μs 23.1±0.1μs 2 series_methods.NanOps.time_func('kurt', 1000, 'int8')
+ 11.6±0.07μs 23.1±0.06μs 1.99 series_methods.NanOps.time_func('kurt', 1000, 'int32')
+ 12.1±0.3μs 23.5±0.2μs 1.94 series_methods.NanOps.time_func('kurt', 1000, 'int64')
+ 11.5±0.05μs 22.3±0.2μs 1.94 series_methods.NanOps.time_func('skew', 1000, 'int64')
+ 12.1±0.2μs 23.2±0.05μs 1.92 series_methods.NanOps.time_func('kurt', 1000, 'float64')
+ 11.7±0.3μs 22.1±0.1μs 1.88 series_methods.NanOps.time_func('skew', 1000, 'int32')
+ 12.0±0.08μs 22.6±0.6μs 1.87 series_methods.NanOps.time_func('skew', 1000, 'int8')
+ 14.9±0.1μs 27.3±0.08μs 1.83 series_methods.NanOps.time_func('skew', 1000, 'Int64')
+ 14.7±0.1μs 26.9±0.05μs 1.83 series_methods.NanOps.time_func('skew', 1000, 'boolean')
+ 15.0±0.3μs 27.2±0.09μs 1.81 series_methods.NanOps.time_func('kurt', 1000, 'Int64')
+ 15.1±0.2μs 27.0±0.2μs 1.79 series_methods.NanOps.time_func('kurt', 1000, 'boolean')
+ 304±2μs 430±20μs 1.41 stat_ops.SeriesOps.time_op('kurt', 'int')
+ 248±1μs 341±2μs 1.38 stat_ops.SeriesMultiIndexOps.time_op('kurt')
+ 246±2μs 338±1μs 1.38 stat_ops.SeriesOps.time_op('kurt', 'float')
+ 1.64±0.02ms 2.22±0.01ms 1.35 stat_ops.FrameOps.time_op('kurt', 'Int64', 0)
+ 239±2μs 323±2μs 1.35 stat_ops.SeriesMultiIndexOps.time_op('skew')
+ 294±0.9μs 398±2μs 1.35 stat_ops.SeriesOps.time_op('skew', 'int')
+ 239±1μs 320±4μs 1.34 stat_ops.SeriesOps.time_op('skew', 'float')
+ 3.16±0.03ms 4.21±0.1ms 1.33 series_methods.NanOps.time_func('kurt', 1000000, 'boolean')
+ 1.00±0ms 1.33±0ms 1.33 stat_ops.FrameOps.time_op('kurt', 'float', None)
+ 1.40±0.01ms 1.87±0.01ms 1.33 stat_ops.FrameOps.time_op('kurt', 'int', 0)
+ 2.42±0ms 3.17±0.01ms 1.31 series_methods.NanOps.time_func('kurt', 1000000, 'float64')
+ 1.14±0.01ms 1.49±0ms 1.31 stat_ops.FrameOps.time_op('kurt', 'float', 0)
+ 3.57±0.06ms 4.63±0.01ms 1.3 series_methods.NanOps.time_func('kurt', 1000000, 'Int64')
+ 1.15±0ms 1.49±0.01ms 1.3 stat_ops.FrameMultiIndexOps.time_op('kurt')
+ 1.37±0ms 1.78±0.01ms 1.3 stat_ops.FrameOps.time_op('skew', 'int', 0)
+ 1.85±0.01ms 2.38±0.01ms 1.29 stat_ops.FrameOps.time_op('kurt', 'int', None)
+ 1.61±0.02ms 2.08±0.01ms 1.29 stat_ops.FrameOps.time_op('skew', 'Int64', 0)
+ 966±3μs 1.25±0ms 1.29 stat_ops.FrameOps.time_op('skew', 'float', None)
+ 1.09±0ms 1.40±0.01ms 1.28 stat_ops.FrameOps.time_op('skew', 'float', 0)
+ 2.29±0.01ms 2.92±0.01ms 1.27 series_methods.NanOps.time_func('skew', 1000000, 'float64')
+ 1.10±0ms 1.40±0ms 1.27 stat_ops.FrameMultiIndexOps.time_op('skew')
+ 1.81±0ms 2.30±0.01ms 1.27 stat_ops.FrameOps.time_op('skew', 'int', None)
+ 3.22±0.02ms 3.98±0.04ms 1.24 series_methods.NanOps.time_func('kurt', 1000000, 'int32')
+ 3.38±0.01ms 4.20±0ms 1.24 series_methods.NanOps.time_func('skew', 1000000, 'Int64')
+ 3.02±0.01ms 3.76±0.01ms 1.24 series_methods.NanOps.time_func('skew', 1000000, 'boolean')
+ 3.45±0.03ms 4.24±0.01ms 1.23 series_methods.NanOps.time_func('kurt', 1000000, 'int64')
+ 3.10±0.02ms 3.81±0.01ms 1.23 series_methods.NanOps.time_func('kurt', 1000000, 'int8')
+ 2.95±0.01ms 3.60±0.03ms 1.22 series_methods.NanOps.time_func('skew', 1000000, 'int8')
+ 3.06±0.02ms 3.71±0.01ms 1.21 series_methods.NanOps.time_func('skew', 1000000, 'int32')
+ 3.31±0.02ms 4.00±0.01ms 1.21 series_methods.NanOps.time_func('skew', 1000000, 'int64')
+ 2.04±0.01ms 2.45±0.1ms 1.2 stat_ops.FrameOps.time_op('kurt', 'Int64', None)
+ 1.96±0.01ms 2.32±0.02ms 1.18 stat_ops.FrameOps.time_op('skew', 'Int64', None)

@Alvaro-Kothe
Copy link
Copy Markdown
Member Author

@jbrockmendel Seems to be working with macOS.

@Alvaro-Kothe Alvaro-Kothe marked this pull request as ready for review March 22, 2026 16:29
@Alvaro-Kothe Alvaro-Kothe changed the title PERF: [POC] compute skew and kurtosis with multithreading PERF: compute skew and kurtosis with multithreading and SIMD Mar 22, 2026
@jbrockmendel
Copy link
Copy Markdown
Member

Theres been discussion of how to opt in/out of parallelism and i think the we're converging to default-on with opt-out via max_workers option.

@Alvaro-Kothe
Copy link
Copy Markdown
Member Author

Theres been discussion of how to opt in/out of parallelism and i think the we're converging to default-on with opt-out via max_workers option.

This is great! But the way that currently is in this PR, OpenMP is a system depenency. If we compile with this dependency it's necessary that the host have OpenMP, otherwise will have a runtime error.

@jbrockmendel
Copy link
Copy Markdown
Member

#64541 is written with cython in mind, but the idea there is to always detect it if present rather than require the users to explicitly opt in

@Alvaro-Kothe
Copy link
Copy Markdown
Member Author

I don't know if Cython provides a way to check at runtime if OpenMP is available, but to show the problem of distributing pandas with OpenMP as a runtime dependency is this:

  ## Build wheel
$ CC=clang CXX=clang++ python -m build  --wheel -Cbuild-dir=build/clang -Csetup-args=-Duse_openmp=true .
  ## Create a container
$ podman run --rm -it -v $(pwd)/dist:/dist -w /dist docker.io/python:3.14 bash
  ## create venv
root@dc5d67e6a666:/dist# python -m venv venv
  ## Install the wheel
root@dc5d67e6a666:/dist# venv/bin/pip install pandas-3.1.0.dev0+422.g69e47c298f-cp314-cp314-linux_x86_64.whl
  ## print dependencies
root@dc5d67e6a666:/dist# ldd venv/lib64/python3.14/site-packages/pandas/_libs/algos.cpython-314-x86_64-linux-gnu.so
        linux-vdso.so.1 (0x00007f66a34cc000)
        libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007f66a3233000)
        libomp.so => not found
        libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007f66a303f000)
        /lib64/ld-linux-x86-64.so.2 (0x00007f66a34ce000)
   ## Import pandas with missing libomp
root@dc5d67e6a666:/dist# venv/bin/python -c 'import pandas'
Traceback (most recent call last):
  File "<string>", line 1, in <module>
    import pandas
  File "/dist/venv/lib/python3.14/site-packages/pandas/__init__.py", line 44, in <module>
    import pandas.core.config_init  # pyright: ignore[reportUnusedImport] # noqa: F401
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/dist/venv/lib/python3.14/site-packages/pandas/core/config_init.py", line 31, in <module>
    from pandas.errors import Pandas4Warning
  File "/dist/venv/lib/python3.14/site-packages/pandas/errors/__init__.py", line 12, in <module>
    from pandas._libs.tslibs import (
    ...<3 lines>...
    )
  File "/dist/venv/lib/python3.14/site-packages/pandas/_libs/__init__.py", line 18, in <module>
    from pandas._libs.interval import Interval
  File "pandas/_libs/intervaltree.pxi", line 7, in init pandas._libs.interval
ImportError: libomp.so: cannot open shared object file: No such file or directory

For me it's safer to distribute without OpenMP as a runtime dependency. Another option is to vendor OpenMP and either distribute the libomp shared object or statically link it.

@Alvaro-Kothe
Copy link
Copy Markdown
Member Author

In the example above I purposely built with clang because it links against libomp, if I compiled with gcc it would have linked against libgomp which is available in the container:

root@dc5d67e6a666:/dist# find /usr -name 'libgomp.so*' -type f
/usr/lib/x86_64-linux-gnu/libgomp.so.1.0.0

@jbrockmendel
Copy link
Copy Markdown
Member

Essentially no one manually compiles. If thats the only way to enable openmp, it will go unused.

when openmp is unavailable cython.prange falls back to regular-range. I don't know the details of how/when that check occurs.

@Alvaro-Kothe Alvaro-Kothe added the Build Library building on various platforms label Mar 23, 2026
@Alvaro-Kothe
Copy link
Copy Markdown
Member Author

Essentially no one manually compiles. If that's the only way to enable OpenMP, it will go unused.

I've updated to automatically use OpenMP when available. However, we still need to decide how to handle requiring OpenMP. I came across a helpful discussion in LightGBM on this topic:
lightgbm-org/LightGBM#7141.

Several projects also use OpenMP; here’s how they distribute it:

  • LightGBM: OpenMP is a required system dependency.
  • XGBoost: Vendors the shared object in the wheel.
  • PyTorch: Vendors the shared object in the wheel.
  • scikit-learn: Vendors the shared object in the wheel.

@Alvaro-Kothe
Copy link
Copy Markdown
Member Author

Alvaro-Kothe commented Mar 23, 2026

when openmp is unavailable cython.prange falls back to regular-range. I don't know the details of how/when that check occurs.

@jbrockmendel I just verified with prange only. It chooses between range and prange at compile time, but OpenMP is still a runtime dependency, so the error that I've shown in #64582 (comment) is reproducible.

@Alvaro-Kothe
Copy link
Copy Markdown
Member Author

Downloaded one of the linux wheels, libgomp is being bundled in the wheel.

$ unzip -l pandas-3.1.0.dev0+491.g0c9957a9bc-cp314-cp314-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl "**/libgo*"
Archive:  pandas-3.1.0.dev0+491.g0c9957a9bc-cp314-cp314-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl
  Length      Date    Time    Name
---------  ---------- -----   ----
   253289  03-23-2026 22:24   pandas.libs/libgomp-e985bcbb.so.1.0.0
---------                     -------
   253289                     1 file

And the rpath is correctly set

$ unzip pandas-3.1.0.dev0+491.g0c9957a9bc-cp314-cp314-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl "**/libgo*" "**/algos.cp*" -d pandas
$ ldd pandas/pandas/_libs/algos.cpython-314-x86_64-linux-gnu.so
        linux-vdso.so.1 (0x00007f43c2181000)
        libm.so.6 => /lib64/libm.so.6 (0x00007f43c1e70000)
        libgomp-e985bcbb.so.1.0.0 => /home/alvaro/Downloads/pandas/pandas/_libs/../../pandas.libs/libgomp-e985bcbb.so.1.0.0 (0x00007f43c1c00000)
        libc.so.6 => /lib64/libc.so.6 (0x00007f43c1a0d000)
        /lib64/ld-linux-x86-64.so.2 (0x00007f43c2183000)
        libdl.so.2 => /lib64/libdl.so.2 (0x00007f43c1e6c000)
        libpthread.so.0 => /lib64/libpthread.so.0 (0x00007f43c1e68000)

@Alvaro-Kothe
Copy link
Copy Markdown
Member Author

Alvaro-Kothe commented Mar 23, 2026

On Windows, it vendors vcomp140-*.dll.

Mac doesn't find OpenMP. Probably would have to install it with homebrew. xref: #64541

Copy link
Copy Markdown
Member

@WillAyd WillAyd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would also advise dropping openmp, particularly if the most measurable speedup comes from SIMD

#endif // x86_64 + glibc + target_clones

/* --- SIMD Implementation --- */
#if __has_attribute(ext_vector_type)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would advise against having this detection logic built into pandas - I think support can be more complicated than this.

Meson has a built-in function for detecting SIMD support that we should use instead. See https://mesonbuild.com/Simd-module.html

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would advise against having this detection logic built into pandas - I think support can be more complicated than this.

I like the vector extensions provided by clang and gcc because it abstracts away the architecture and the cpu capabilities (with runtime dispatch provided by target_clones). So with a single implementation, we have vectorized code for x86, x86_64, arm and powerPC. With the drawback that it isn't supported with MSVC.

Meson has a built-in function for detecting SIMD support that we should use instead. See https://mesonbuild.com/Simd-module.html

Can you clarify what you have in mind? From what I've seen in the usage, we would have to provide an implementation for each target and also a function to check for cpu capability.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would also advise dropping openmp, particularly if the most measurable speedup comes from SIMD

The performance increase of both together is considerable, SIMD alone had a performance increase of 2-4x, while together had a performance increase of 3-12x.

I think that a middle ground is to don't build the wheel with OpenMP, or leave OpenMP disabled by default.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. is there any reason why this has to be all at once?
  2. ENH: Add OpenMP detection for cython.prange support #64541 is intended to make prange Just Work from cython. Would that also make it Just Work directly in C?
  3. could we do the simd-part of the implementation in C and iterate over columns using prange in cython? To the extent we can keep logic in cython (without a perf hit), that makes maintenance easier.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. is there any reason why this has to be all at once?

It doesn't have to be all at once. I can split this PR if necessary.

  1. ENH: Add OpenMP detection for cython.prange support #64541 is intended to make prange Just Work from cython. Would that also make it Just Work directly in C?

Yes. What matters is compiling with -fopenmp.

  1. could we do the simd-part of the implementation in C and iterate over columns using prange in cython? To the extent we can keep logic in cython (without a perf hit), that makes maintenance easier.

Yes, it's also possible to choose which one to parallelize. OpenMP, by default, uses a maximum parallelization level of 1. If the outer loop is run in parallel, it won't spawn any more threads if we call this function. If the outer loop don't run in parallel, this function can run in parallel still. It's possible to control it through the if clause.

Just to clarify, this PR only touches the scalar reduction. The performance increase in stat_ops.FrameOps.time_op('skew', 'Int64', 0) was because the compiler decided to call the scalar reduction. I didn't modify the logic from accumulate_moments_axis.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you clarify what you have in mind? From what I've seen in the usage, we would have to provide an implementation for each target and also a function to check for cpu capability.

Hmm OK - I'm not very familiar with that gcc intrinsic; I would check upstream in Meson to see if anyone has used that alongside the simd module. As a convention though, we have only ever really stuck to the C standard for writing C extensions in pandas. That could leave some performance on the table, but we don't have a good infrastructure for heavily customized C/C++ development.

The pattern described in the documentation is fairly common though; you can see that Arrow does something similar with their compute modules:

https://github.com/apache/arrow/tree/main/cpp/src/arrow/compute/kernels

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@WillAyd I did some experimentation with meson's simd module on #64905 and one of the problems that I found is that it doesn't detect neon support (mesonbuild/meson#11209) and it doesn't support AVX512 instructions (mesonbuild/meson#2085).

Copy link
Copy Markdown
Member Author

@Alvaro-Kothe Alvaro-Kothe Apr 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not very familiar with that gcc intrinsic; I would check upstream in Meson to see if anyone has used that alongside the simd module.

I didn't manage to find it. It also seems that it cannot be used with the target_clones and target attributes, because all versions of the function must be in the same translation unit.

It may be possible to use the simd module with the vector extensions alone, but will need to manage the naming, multiversioning, linking and runtime dispatch by ourselves.

@Alvaro-Kothe
Copy link
Copy Markdown
Member Author

Closing in favour of more portable approaches across different compilers.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Build Library building on various platforms

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Commit 027ea1edde4ee1de97c89cca4f782a00d1955375

3 participants