Skip to content

PERF: Add SIMD instructions with xsimd to reduce moments#64905

Open
Alvaro-Kothe wants to merge 11 commits into
pandas-dev:mainfrom
Alvaro-Kothe:perf/skew-kurt-omp-xsimd
Open

PERF: Add SIMD instructions with xsimd to reduce moments#64905
Alvaro-Kothe wants to merge 11 commits into
pandas-dev:mainfrom
Alvaro-Kothe:perf/skew-kurt-omp-xsimd

Conversation

@Alvaro-Kothe
Copy link
Copy Markdown
Member

@Alvaro-Kothe Alvaro-Kothe commented Mar 28, 2026


Overview

This PR introduces SIMD for moment accumulation for skewness and kurtosis, using the xsimd library to wrap SIMD instructions and runtime dispatch. A meson option is added to disable SIMD, and CI verifies both the SIMD and scalar-only paths.

The algorithm is basically the Welford algorithm, computing central moments in a single pass.

The test test_stat_method had to be modified, because the order that the values are accumulated changed and due to float point arithmetic the test can't assert strict equality.

Benchmark

SSE2 Benchmark

Change Before [14dc0c6] After [32321c6] <perf/skew-kurt-omp-xsimd> Ratio Benchmark (Parameter)
- 15.9±0.3μs 13.0±0.5μs 0.82 series_methods.NanOps.time_func('kurt', 1000, 'int64')
- 15.3±0.4μs 12.5±0.2μs 0.82 series_methods.NanOps.time_func('skew', 1000, 'float64')
- 18.4±1μs 14.9±0.1μs 0.81 series_methods.NanOps.time_func('skew', 1000, 'Int64')
- 19.8±0.5μs 15.0±0.6μs 0.76 series_methods.NanOps.time_func('kurt', 1000, 'boolean')
- 15.4±0.2μs 11.7±0.4μs 0.76 series_methods.NanOps.time_func('skew', 1000, 'int64')
- 20.5±1μs 14.6±0.5μs 0.71 series_methods.NanOps.time_func('kurt', 1000, 'Int64')
- 16.4±0.3μs 11.6±0.3μs 0.71 series_methods.NanOps.time_func('skew', 1000, 'int8')
- 16.2±2μs 11.4±0.2μs 0.7 series_methods.NanOps.time_func('skew', 1000, 'int32')
- 17.4±1μs 12.0±0.3μs 0.69 series_methods.NanOps.time_func('kurt', 1000, 'float64')
- 4.38±0.01ms 2.80±0.05ms 0.64 stat_ops.FrameOps.time_op('kurt', 'Int64', None)
- 4.26±0.01ms 2.74±0.03ms 0.64 stat_ops.FrameOps.time_op('skew', 'Int64', None)
- 4.07±0.01ms 2.59±0.01ms 0.64 stat_ops.FrameOps.time_op('skew', 'int', None)
- 4.18±0.02ms 2.60±0.01ms 0.62 stat_ops.FrameOps.time_op('kurt', 'int', None)
- 8.85±0.02ms 5.28±0.02ms 0.6 series_methods.NanOps.time_func('skew', 1000000, 'Int64')
- 3.97±0.04ms 2.40±0.01ms 0.6 stat_ops.FrameOps.time_op('skew', 'Int64', 0)
- 9.21±0.01ms 5.43±0.03ms 0.59 series_methods.NanOps.time_func('kurt', 1000000, 'Int64')
- 9.08±0.04ms 5.35±0.04ms 0.59 series_methods.NanOps.time_func('kurt', 1000000, 'int64')
- 8.80±0.04ms 5.20±0.03ms 0.59 series_methods.NanOps.time_func('skew', 1000000, 'int64')
- 8.86±0.04ms 5.13±0.02ms 0.58 series_methods.NanOps.time_func('kurt', 1000000, 'boolean')
- 8.86±0.05ms 5.13±0.03ms 0.58 series_methods.NanOps.time_func('kurt', 1000000, 'int32')
- 8.53±0.01ms 4.94±0.01ms 0.58 series_methods.NanOps.time_func('skew', 1000000, 'boolean')
- 8.57±0.01ms 4.93±0.02ms 0.58 series_methods.NanOps.time_func('skew', 1000000, 'int32')
- 8.57±0.1ms 4.89±0.03ms 0.57 series_methods.NanOps.time_func('kurt', 1000000, 'int8')
- 8.48±0.03ms 4.82±0.04ms 0.57 series_methods.NanOps.time_func('skew', 1000000, 'int8')
- 4.29±0.1ms 2.45±0.03ms 0.57 stat_ops.FrameOps.time_op('kurt', 'Int64', 0)
- 3.83±0.01ms 2.19±0.01ms 0.57 stat_ops.FrameOps.time_op('kurt', 'int', 0)
- 3.74±0.01ms 2.15±0.02ms 0.57 stat_ops.FrameOps.time_op('skew', 'int', 0)
- 874±20μs 487±20μs 0.56 stat_ops.SeriesOps.time_op('skew', 'int')
- 3.46±0.02ms 1.89±0.02ms 0.55 stat_ops.FrameMultiIndexOps.time_op('skew')
- 911±6μs 504±10μs 0.55 stat_ops.SeriesOps.time_op('kurt', 'int')
- 7.87±0.01ms 4.25±0.05ms 0.54 series_methods.NanOps.time_func('kurt', 1000000, 'float64')
- 3.61±0.03ms 1.96±0.02ms 0.54 stat_ops.FrameMultiIndexOps.time_op('kurt')
- 3.57±0.02ms 1.92±0.01ms 0.54 stat_ops.FrameOps.time_op('kurt', 'float', 0)
- 3.43±0.02ms 1.87±0.01ms 0.54 stat_ops.FrameOps.time_op('skew', 'float', 0)
- 7.84±0.02ms 4.18±0.01ms 0.53 series_methods.NanOps.time_func('skew', 1000000, 'float64')
- 3.40±0.05ms 1.80±0.01ms 0.53 stat_ops.FrameOps.time_op('kurt', 'float', None)
- 3.33±0.02ms 1.76±0ms 0.53 stat_ops.FrameOps.time_op('skew', 'float', None)
- 765±1μs 402±3μs 0.53 stat_ops.SeriesOps.time_op('skew', 'float')
- 789±20μs 407±3μs 0.52 stat_ops.SeriesMultiIndexOps.time_op('skew')
- 824±10μs 417±1μs 0.51 stat_ops.SeriesOps.time_op('kurt', 'float')
- 851±10μs 412±7μs 0.48 stat_ops.SeriesMultiIndexOps.time_op('kurt')

@Alvaro-Kothe Alvaro-Kothe force-pushed the perf/skew-kurt-omp-xsimd branch from 7c00f3e to e78fb18 Compare March 28, 2026 13:37
@jbrockmendel
Copy link
Copy Markdown
Member

What does this do to wheel size and import-time memory footprint?

@Alvaro-Kothe
Copy link
Copy Markdown
Member Author

Alvaro-Kothe commented Mar 28, 2026

I didn't see any changes on the wheel size and the memory footprint at import type increase by 4 KiB.

Edit: I was looking at the wrong file. The wheel size decreased by 20 KiB.

@Alvaro-Kothe Alvaro-Kothe force-pushed the perf/skew-kurt-omp-xsimd branch 3 times, most recently from a6fad44 to 05dbcbe Compare March 31, 2026 00:34
@Alvaro-Kothe Alvaro-Kothe force-pushed the perf/skew-kurt-omp-xsimd branch from 95bbc20 to b0966e8 Compare April 3, 2026 13:43
@jbrockmendel jbrockmendel added the Performance Memory or execution speed performance label Apr 6, 2026
@jbrockmendel
Copy link
Copy Markdown
Member

@Alvaro-Kothe can you attend the dev call on the 22nd? I think that's the time+place to convince the rest of the team to move forward with SIMD.

@Alvaro-Kothe
Copy link
Copy Markdown
Member Author

@Alvaro-Kothe can you attend the dev call on the 22nd?

Yeah, I can probably attend.

@jorisvandenbossche
Copy link
Copy Markdown
Member

@Alvaro-Kothe on #64582 (comment) you said:

@WillAyd I did some experimentation with meson's simd module on #64905 and one of the problems that I found is that it doesn't detect neon support (mesonbuild/meson#11209) and it doesn't support AVX512 instructions (mesonbuild/meson#2085).

That makes it essentially a non-starter to use meson-simd?

BTW, as far as I understand, numpy essentially implemented their own simd meson module (presumably because the upstream one wasn't sufficient or marked as experimental). No idea how reusable that would be or if there are plans to separate it from numpy.

@jorisvandenbossche
Copy link
Copy Markdown
Member

@Alvaro-Kothe can you attend the dev call on the 22nd? I think that's the time+place to convince the rest of the team to move forward with SIMD.

@jbrockmendel please also make the case for what you want to do on the issue you opened. It will definitely be useful to talk about the topic tomorrow, but for such an important topic it is also important to have a written account of that and async discussion.

@Alvaro-Kothe
Copy link
Copy Markdown
Member Author

That makes it essentially a non-starter to use meson-simd?

I found meson's unstable-simd module limited -in terms of supported architectures- and risky due to lack of backward compatibility, it even warns:

../../pandas/_libs/meson.build:59: WARNING: Module SIMD has no backwards or forwards compatibility and might not exist in future releases.

BTW, as far as I understand, numpy essentially implemented their own simd meson module (presumably because the upstream one wasn't sufficient or marked as experimental). No idea how reusable that would be or if there are plans to separate it from numpy.

In the NumPy fork of meson, they created a features module to handle their multiple SIMD architectures necessities.

In here, at least for x86 and ARM, it seems possible to replace the built-in meson module with a dictionary and a foreach loop. This gives us full control over compiler flags and necessary meta-programming, similar to the approach used by Krita to handle their xsimd targets.

SIMD Build Sketch

# SIMD architecture configuration: Map name to compiler flags
simd_arch_config = {}

if host_machine.cpu_family() in ['x86', 'x86_64']
    simd_arch_config += {
        'sse2':    {'flags': cxx.get_id() == 'msvc' ? ['/arch:SSE2'] : ['-msse2']},
        'avx2':    {'flags': cxx.get_id() == 'msvc' ? ['/arch:AVX2'] : ['-mavx2']},
        'avx512f': {'flags': cxx.get_id() == 'msvc' ? ['/arch:AVX512'] : ['-mavx512f']},
    }
elif host_machine.cpu_family() == 'aarch64'
    simd_arch_config += {
        'neon64':  {'flags': []} # Baseline for aarch64
    }
endif

simd_libs = []
simd_config = configuration_data()

foreach name, cfg : simd_arch_config
    # Create a unique source file for each instruction set to avoid linking errors
    src = fs.copyfile(
        'my_simd_module_instantiation_.cpp',
        'my_simd_module_@0@.cpp'.format(name),
    )

    # Compile a static library for this specific architecture
    lib = static_library(
        'my_simd_module_@0@'.format(name),
        src,
        include_directories: [templates_includes],
        dependencies: [xsimd_dep],
        cpp_args: cfg['flags'],
    )
    simd_libs += lib

    # Define preprocessor macro (e.g., PANDAS_HAVE_AVX2)
    simd_config.set('PANDAS_HAVE_@0@'.format(name.to_upper()), 1)
endforeach

configure_file(
    output: 'my_simdconfig.h',
    configuration: simd_config,
)

my_simd_libraries_dep = declare_dependency(
    link_with: simd_libs,
    include_directories: include_directories('.'),
    dependencies: [xsimd_dep],
)

@jorisvandenbossche
Copy link
Copy Markdown
Member

BTW, it seems that xsimd is also currently adding more CPU feature detection (xtensor-stack/xsimd#1245, and a bunch of recent PRs that were merged)

@Alvaro-Kothe Alvaro-Kothe force-pushed the perf/skew-kurt-omp-xsimd branch from d680ec2 to 4389094 Compare May 2, 2026 18:33
@Alvaro-Kothe Alvaro-Kothe changed the title PERF: [POC] use xsimd with meson simd module to reduce moments PERF: [POC] Add SIMD instructions with xsimd to reduce moments May 2, 2026
@Alvaro-Kothe Alvaro-Kothe force-pushed the perf/skew-kurt-omp-xsimd branch 2 times, most recently from 2dbb13e to 3a1e345 Compare May 4, 2026 22:08
@Alvaro-Kothe Alvaro-Kothe changed the title PERF: [POC] Add SIMD instructions with xsimd to reduce moments PERF: Add SIMD instructions with xsimd to reduce moments May 4, 2026
@Alvaro-Kothe Alvaro-Kothe marked this pull request as ready for review May 4, 2026 23:01
@Alvaro-Kothe Alvaro-Kothe requested a review from mroeschke as a code owner May 4, 2026 23:01
@Alvaro-Kothe Alvaro-Kothe force-pushed the perf/skew-kurt-omp-xsimd branch 2 times, most recently from ab16e69 to 1d7d462 Compare May 5, 2026 03:17
Copy link
Copy Markdown
Member

@WillAyd WillAyd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Alvaro-Kothe is there a way we can split this into smaller chunks? As is, this is a huge change and will need a good deal of revision

Maybe its best to get SIMD detection / xsimd integration done as precursors and then come back to the Moments impl?

Comment thread pandas/_libs/src/moments.cpp Outdated
/// https://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#Higher-order_statistics
void moments_merge(Moments *acc, const Moments *src, int max_moment) {
if (acc->n == 0) {
acc->n = src->n;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be a memcpy not a bunch of individual assignments

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed.

Comment thread pandas/_libs/src/moments.cpp Outdated
return;
}

double n_a = (double)acc->n;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
double n_a = (double)acc->n;
double n_a = acc->n;

These are C-style casts that should not be used in cpp, but also are unnecessary

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This give warnings when compiling with -Wconversion, changed to static_cast

Comment thread pandas/_libs/src/moments.cpp Outdated
#include <math.h>
#include <stdint.h>

extern "C" {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
extern "C" {

The extern block belongs in the header

Comment thread pandas/_libs/meson.build Outdated
supported_simd_archs = {}
if get_option('simd').allowed()
foreach name, flags : simd_arch_flags
if host_machine.cpu_family() in ['x86', 'x86_64']
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you sure this is correct? I am wary of maintaining our own logic for which systems support which instruction sets

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For this initial PR I was trying to limit the architectures to the ones we have wheels for.

For the instructions sets, SSE2 is always available on 64-bit, but may not be available on 32-bit, so this is basically creating a minimum CPU requirement in pandas of SSE2 for x86.

Comment thread pandas/_libs/meson.build Outdated
moments_config.set('PANDAS_HAVE_SCALAR', 1)
endif

configure_file(
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you will want to install this file. I'd also suggest using a prefix of pandas/ - not that its going to be installed onto a real system but probably best to follow that convention so meson-python intercepts it consistently

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you will want to install this file.

From what I've seen, this would only be included in the wheel, and since we are not installing header files, installing this config file doesn't bring any benefit IMO.

Here are the headers included in the wheel if I were to add this change:

$ unzip -l dist/pandas-3.1.0.dev0+840.g9b7a79ac4b.dirty-cp314-cp314-linux_x86_64.whl "*.h"
Archive:  dist/pandas-3.1.0.dev0+840.g9b7a79ac4b.dirty-cp314-cp314-linux_x86_64.whl
  Length      Date    Time    Name
---------  ---------- -----   ----
      197  05-05-2026 13:25   pandas/_libs/moments_simdconfig.h
---------                     -------
      197                     1 file

I'd also suggest using a prefix of pandas/

Specifying directories in configure_file is only available on meson 1.10.0, For prior versions, I have to move this configuration logic to another meson.build file.

I do agree it's better to have the configuration in another directory, so I will make this change.

Comment thread pandas/_libs/meson.build Outdated
m_dep = cc.find_library('m', required: false)
fast_float = subproject('fast_float')
fast_float_dep = fast_float.get_variable('fast_float_dep')
xsimd = subproject('xsimd')
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would still prefer the dependency syntax we have used on other PRs. If we wanted it tied to a particular version we could do that

return moments_acc;
}

#define MOMENTS_EXTERN_TEMPLATE(ARCH) \
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I follow these patterns - using a macro to then instantiate a template seems to really mix up C / C++ patterns. Can't the template take care of this by itself?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea let's get rid of these extern declarations and move what's in the module to the header. That's less performant from a compilation perspective, but I don't think that's a huge deal. This feels like a premature optimization.

Regarding the mix of macros/templates, let's prefer C++17 features. I think something like this in the config file:

constexpr bool kEnableAVX512CD = @PANDAS_HAVE_AVX512CD@;
constexpr bool kEnableAVX2 = @PANDAS_HAVE_AVX2@;
...

Could simplify the header a lot. To the effect of

constexpr bool kUseSimd = std::enable_if_t<kEnableAVX512CD> || std::enable_if_t<kEnableAVX2>;

struct accumulate_moments_simd {
  template <class Arch>
  Moments operator()([[maybe_unused]] Arch, const double *values, std::size_t n,
                     int skipna, const uint8_t *mask, int max_moment) {
    if constexpr(kUseSimd) {
      // SIMD implementation that forwards the ARCH type to xsimd
    } else {
      // Non-SIMD implementation (uses xsimd::common?)
    }
  }
};

Sorry if misreading - the layers of indirection here are hard to follow, so I think there's a more concise way of expressing

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

C++20 might make the intent even clearer if we wanted separate methods:

struct accumulate_moments_simd {
  template <class Arch> requires (kUseSimd)
  Moments operator()(Arch, const double *values, std::size_t n,
                     int skipna, const uint8_t *mask, int max_moment) {
      ...
  }

  Moments operator()(xsimd::common, const double *values, std::size_t n,
                     int skipna, const uint8_t *mask, int max_moment) {
      ...
  }
};

There might even be some constexpr vector / enum tricks to be had instead of having to send the Arch type and maintain a constant for allowed SIMD types separately

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am having a hard time to understand this one. How would the template be instantiated?

Just to clarify the purpose of MOMENTS_EXTERN_TEMPLATE:

The dispatch method tries to implicitly instantiate all architectures in pandas::moments::arch_list, but it won't compile them without the proper flags. So I created this macro to define external linkage and prevent implicit instantiation.

For MOMENTS_INSTANTIATE:

It's exactly as the name says, it instantiate the method for the target architecture. moments_simd.cpp is compiled multiple times with different compiler flags and selecting the proper instantiation with -DPANDAS_SIMD_<ARCH>, here is the relevant meson instruction:

 moments_libs += static_library(
        'moments_simd_@0@'.format(arch_name),
        'src/moments_simd.cpp',
        include_directories: [inc_pd],
        dependencies: [xsimd_dep],
        cpp_args: arch_flags + [
            '-DPANDAS_SIMD_@0@'.format(arch_name.to_upper()),
        ],

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The dispatch method tries to implicitly instantiate all architectures in pandas::moments::arch_list, but it won't compile them without the proper flags. So I created this macro to define external linkage and prevent implicit instantiation.

It could be my lack of understanding but I don't think this is a very common pattern - is that documented by xsimd? I think changing the linkage is more of a compile-time performance optimization, which would barely register in our code base but adds a lot of indirection. In the general case of C++ I would expect the header file to contain the template declaration / instantiations

I am having a hard time to understand this one. How would the template be instantiated?

I don't think the pattern would be any different. If you wanted to lean into the constexpr pattern, you could replace:

#if PANDAS_HAVE_AVX512CD
    ::add<xsimd::avx512cd>
#endif

with

if constexpr (kEnableAVX512CD) {
    ::add<xsimd::avx512cd>
}

But I don't think that matters much for the rest of the implementation

Copy link
Copy Markdown
Member

@WillAyd WillAyd May 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not familiar with Krita but on a quick scan I see that the xsimd implementation makes heavy use of macros in the header file to control the implementation:

https://invent.kde.org/kenoi/krita/-/blob/master/libs/pigment/KoOptimizedPixelDataScalerU8ToU16.h

That is not what's going on this PR, which I think is in a middle state. I would prefer the Arrow approach as its a closer project to pandas

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer the Arrow approach as its a closer project to pandas

Fair point. Changed.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool thanks. FWIW its also the documented approach for xsimd (I'm learning about this as we go, so appreciate your patience)

https://xsimd.readthedocs.io/en/latest/api/dispatching.html#arch-dispatching

I'd also suggest we get rid of the MOMENTS_INSTANTIATE macro in each file - you can just define the template with the appropriate architecture set

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should I remove the MOMENTS_EXTERN_TEMPLATE macro too?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea - let's try to limit our macro use.

Comment thread pandas/_libs/src/moments_simd.cpp Outdated

namespace pandas::moments {

#if defined(PANDAS_SIMD_AVX512CD)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we get rid of the macro wrapping a template as described in another comment then this file seems unnecessary - I think this should all be instatiated in the header

}

template <class Arch>
Moments accumulate_moments_simd::operator()(Arch /*unused*/,
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Moments accumulate_moments_simd::operator()(Arch /*unused*/,
Moments accumulate_moments_simd::operator()([[maybe_unused]] Arch,

Is this throwing a warning now?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added /*unused*/ was to make clang-tidy happy.

It isn't possible to use [[maybe_unused]] there, AFAIK, this attribute is for functions and class members.

../../pandas/_libs/include/pandas/moments_simd.hpp|157 col 27| warning: attribute ignored [-Wattributes]
||   157 |   Moments operator()(Arch [[maybe_unused]], const double *values, std::size_t n,
||       |                           ^

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that's because you have it backwards - should be [[maybe_unused]] Arch not Arch [[maybe_unused]]

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, it was swapped, but it didn't fix the clang-tidy warning

All parameters should be named in a function [readability-named-parameter]

Anyway, guess it can be ignored.


struct accumulate_moments_simd {
template <class Arch>
Moments operator()(Arch /*unused*/, const double *values, std::size_t n,
Copy link
Copy Markdown
Member

@WillAyd WillAyd May 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Moments operator()(Arch /*unused*/, const double *values, std::size_t n,
Moments operator()(Arch /*unused*/, std::vector<double> values, std::size_t n,

Unless there's a need for extern linkage (which isn't possible with a templated function anyway) we should be using thre standard C++ types)

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The dispatch function (reduce_moments) needs linkage with "C" to be used in Cython. AFAIK, there isn't a vector constructor from a pointer and I don't think it's worth copying the data in values into a container.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where is reduce_moments defined?

Generally the strategy should be to reduce the size of the C interface when dealing with C++. So I'd be interested to see the call site in Cython and see what we can do to elide the extern C requirement, since AFAIU Cython can call invoke C++ templates directly

If it helps, we could also consider bumping to C++20 and taking advantage of std::span, which essentially declares this as requiring a non-owning view into a buffer of some size

Copy link
Copy Markdown
Member

@WillAyd WillAyd May 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I see it in algos.pyx - would it make more sense to have an algos_cpp.pyx file that can be more C++ native?

Just curious as I have limited experience with Cython's C++ wrapper. If that's infeasible then I think we just need to re-evaluate the size of the extern declaration, and perhaps create a shim that will take something like the raw C storage and size and convert it to a std::span for the rest of the implementation to use

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where is reduce_moments defined?

It's defined in moments.cpp.

Cython can call invoke C++

This is correct, but I think it requires transpiling algos.pyx to C++ instead of C.

Copy link
Copy Markdown
Member Author

@Alvaro-Kothe Alvaro-Kothe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there a way we can split this into smaller chunks?

Sure, will move SIMD detection to another PR.


struct accumulate_moments_simd {
template <class Arch>
Moments operator()(Arch /*unused*/, const double *values, std::size_t n,
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The dispatch function (reduce_moments) needs linkage with "C" to be used in Cython. AFAIK, there isn't a vector constructor from a pointer and I don't think it's worth copying the data in values into a container.

}

template <class Arch>
Moments accumulate_moments_simd::operator()(Arch /*unused*/,
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added /*unused*/ was to make clang-tidy happy.

It isn't possible to use [[maybe_unused]] there, AFAIK, this attribute is for functions and class members.

../../pandas/_libs/include/pandas/moments_simd.hpp|157 col 27| warning: attribute ignored [-Wattributes]
||   157 |   Moments operator()(Arch [[maybe_unused]], const double *values, std::size_t n,
||       |                           ^

Comment thread pandas/_libs/src/moments.cpp Outdated
/// https://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#Higher-order_statistics
void moments_merge(Moments *acc, const Moments *src, int max_moment) {
if (acc->n == 0) {
acc->n = src->n;
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed.

Comment thread pandas/_libs/src/moments.cpp Outdated
return;
}

double n_a = (double)acc->n;
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This give warnings when compiling with -Wconversion, changed to static_cast

Comment thread pandas/_libs/meson.build Outdated
supported_simd_archs = {}
if get_option('simd').allowed()
foreach name, flags : simd_arch_flags
if host_machine.cpu_family() in ['x86', 'x86_64']
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For this initial PR I was trying to limit the architectures to the ones we have wheels for.

For the instructions sets, SSE2 is always available on 64-bit, but may not be available on 32-bit, so this is basically creating a minimum CPU requirement in pandas of SSE2 for x86.

Comment thread pandas/_libs/meson.build Outdated
moments_config.set('PANDAS_HAVE_SCALAR', 1)
endif

configure_file(
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you will want to install this file.

From what I've seen, this would only be included in the wheel, and since we are not installing header files, installing this config file doesn't bring any benefit IMO.

Here are the headers included in the wheel if I were to add this change:

$ unzip -l dist/pandas-3.1.0.dev0+840.g9b7a79ac4b.dirty-cp314-cp314-linux_x86_64.whl "*.h"
Archive:  dist/pandas-3.1.0.dev0+840.g9b7a79ac4b.dirty-cp314-cp314-linux_x86_64.whl
  Length      Date    Time    Name
---------  ---------- -----   ----
      197  05-05-2026 13:25   pandas/_libs/moments_simdconfig.h
---------                     -------
      197                     1 file

I'd also suggest using a prefix of pandas/

Specifying directories in configure_file is only available on meson 1.10.0, For prior versions, I have to move this configuration logic to another meson.build file.

I do agree it's better to have the configuration in another directory, so I will make this change.

@@ -0,0 +1,11 @@
project(
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I submitted it a few weeks ago: mesonbuild/wrapdb#2705

)

xsimd_inc = include_directories('include')

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not for now, since we are not adding xtl to deal with complex numbers.

version: '14.1.0',
)

xsimd_inc = include_directories('include')
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The original goal of this meson.build file was to basically vendor xsimd, if I understand correctly, you are saying to add install_headers there, but IMO it will complicate installing pandas, requiring to add -Cinstall-args=--skip-subprojects to be able to build the wheel and to install from source with pip install

Comment thread meson.options Outdated
@@ -0,0 +1,5 @@
option(
'simd',
type: 'feature',
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

auto_features is auto by default

@Alvaro-Kothe Alvaro-Kothe force-pushed the perf/skew-kurt-omp-xsimd branch 2 times, most recently from 56e25a5 to 1e6cef8 Compare May 6, 2026 18:12
const batch_type mean_v = mean + delta_n;
result.mean = mean_v.first();

assert(
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think assert is great to use here - can Cython intercept a C++ exception instead?

Copy link
Copy Markdown
Member Author

@Alvaro-Kothe Alvaro-Kothe May 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that assert is the right call here, because I don't want this check in release builds.

This is mainly checking if my math is correct.

$$ \bar x = \frac{\sum \bar x_i n_i}{n} = \bar x_j + \frac{\sum n_i (\bar x_i - \bar x_j)}{n} = \bar x_j + \frac{\delta_j}{n} $$

On debug builds it crashes the python interpreter, e.g.

.python: ../../pandas/_libs/include/pandas/simd/moments_simd.hpp:105: Moments pandas::moments::detail::merge_batches(batch_type &, batch_type &, batch_type &, batch_type &, batch_type &, int)
 [batch_type = xsimd::batch<double>, step = 4UL]: Assertion `n < 0' failed.
Fatal Python error: Aborted

std::size_t vec_size = n - (n % step);
for (std::size_t i = 0; i < vec_size; i += step) {
batch_type val = xsimd::load_unaligned<Arch>(values + i);
auto nan_mask = xsimd::isnan(val);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to compute this when mask is present?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I use it inside the if (mask) to abort further computation if there is a NaN value that won't be skipped.

      auto nan_value_not_masked = xsimd::bitwise_andnot(nan_mask, mmask);

batch_type val = xsimd::load_unaligned<Arch>(values + i);
auto nan_mask = xsimd::isnan(val);

if (mask) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm guessing Cython guarantees to send nullptr when the mask is none in Python right? Are there any safeguards against empty or uninitialized masks?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I see - you have mask ? mask + vec_size : nullptr inline as an argument.

I really advise that we use C++ constructs like std::optional. The C-style checks are overloaded with many different intents that are difficult to decipher.

I think a std::optional<std::span> argument is much better than a raw pointer. It would require C++20

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't like the idea of abstracting data access through a span, mainly because on simd you have to pass the raw memory address to load the data, while putting in a span feels like an unnecessary abstraction that will only get in the way.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The xsimd examples in the documentation all use C++ containers. These library features exist for a reason versus writing plain C

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I missed it. Then no objection. I will put the data in a container.


if (xsimd::any(nan_value_not_masked)) {
// values contains a NaN entry that won't be skipped.
return {static_cast<uint64_t>(n), NAN, NAN, NAN, NAN};
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason we declare the n member to be uint64_t? Can it just be size_t to fit more naturally with the code?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clarify intent IMO, it represents sample size instead of array size.

Also, on 32-bit size_t may be 32 bits, while with the merge function you can combine multiple chunks together, exceeding this limit.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, on 32-bit size_t may be 32 bits, while with the merge function you can combine multiple chunks together, exceeding this limit.

Really? How is that data addressable if its outside the limits of the system's size_t?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't have to be addressable, you can compute separatelly in Chunks and combine them, e.g., Compute data on a 32 bit array -> store in moments1 -> Compute on another 32 bit array -> store in moments2 -> combine moments1 with moments 2

The result is 33 bits.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm your example is talking about storing data in n not using n as an indexer - are you sure that's the right model?

I'm unclear how you could store results somewhere that you can't access - does xsimd have documentation or examples for what you have in mind?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, the n member in the Moments struct isn't an indexer, it represents sample size.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is the raw text from Gemini when I asked it to clarify my intent:

"The n member in the Moments struct represents the total sample size, which is distinct from the addressable array size represented by size_t.

While moments_reduce uses size_t to index into memory, the resulting Moments struct is designed to be mergeable via moments_merge. This allows us to compute moments for datasets that are
processed in chunks (e.g., streaming from a file or merging results from multiple threads). On a 32-bit system, the combined n from multiple chunks can easily exceed the 4GB limit of
size_t, even if no single chunk does.


Also, the Moments struct represents a summary object. moments_add_valuem is designed to update it in a streaming fashion and moments_merge in a chunked fashion.

Hope it clarifies. Although, it's probably a little bit premature to prepare these methods for chunked computations and size_t does make sense for current usage.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea I don't think what Gemini is telling you is correct. Let's just stick with size_t - it can be changed in the future if needed (this isn't a public API)

Comment thread pandas/_libs/include/pandas/moments.h Outdated
double m4;
} Moments;

void moments_merge(Moments *acc, const Moments *src, int max_moment);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
void moments_merge(Moments *acc, const Moments *src, int max_moment);
void moments_merge(Moments& acc, const Moments& src, int max_moment);

I don't think a nullptr is an acceptable input here, right?

See also https://isocpp.github.io/CppCoreGuidelines/CppCoreGuidelines#f60-prefer-t-over-t-when-no-argument-is-a-valid-option

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function was meant to be used in C. But since it isn't used anywhere besides merge to the tail, probably will be better to put it in the c++ header.

@Alvaro-Kothe Alvaro-Kothe force-pushed the perf/skew-kurt-omp-xsimd branch 2 times, most recently from fa7e920 to 02063f7 Compare May 6, 2026 22:27
Copy link
Copy Markdown
Member

@WillAyd WillAyd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Making progress - thanks for updating the C++ function signatures

}

/// https://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#Higher-order_statistics
inline void moments_merge(Moments &acc, const Moments &src, int max_moment) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
inline void moments_merge(Moments &acc, const Moments &src, int max_moment) {
void moments_merge(Moments &acc, const Moments &src, int max_moment) {

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also should src be implemented as an rvalue reference or is there a reason for it to have a lifetime beyond this function call?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

inline is required for linkage.

I changed to rvalue and clang-tidy complained

Std::move of the variable 'tail' of the trivially-copyable type 'Moments' has no effect [hicpp-move-const-arg]
   moments_simd.hpp:156:51: Consider changing the 2nd parameter of 'moments_merge' from 'Moments &&' to 'const Moments &'

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh OK cool - that's a good warning. Makes sense

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

inline is required for linkage.

I think you want static then?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Until now I don't know when should static inline be preferred over just inline. AFAIK, inline ensures a single version of this function on the final executable (libalgos) that the linker will choose, probably will reduce binary size, whereas static inline will contain every version of this function in the final binary.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is necessary to use static inline to prevent illegal instruction from occurring.


static inline void moments_add_valuem(Moments &moments, double val,
int max_moment) {
double delta = val - moments.mean;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be great to replace all of the double types with auto - I think this still looks like a C holdover

See https://herbsutter.com/2013/08/12/gotw-94-solution-aaa-style-almost-always-auto/


namespace detail {

static inline void moments_add_valuem(Moments &moments, double val,
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
static inline void moments_add_valuem(Moments &moments, double val,
static void moments_add_valuem(Moments &moments, double val,

Or do we need inline here for any reason?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

inline is necessary for linkage, but I think that static can be removed.


inline uint64_t get_mask_bits(std::span<const uint8_t> mask) {
uint64_t mask_bits = 0;
for (std::size_t j = 0; j < mask.size(); j++) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't this only work if the mask has 64 entries or less?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. With the current implementation, mask.size() is 2, 4 or 8. I always call this function with mask->subspan(i, step)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I see - that invariant isn't super clear.

If this is in a critical path then looping like this is a pretty slow way of packing these bits. You should look at the nanoarrow implementation (would also be OK to include here)

https://github.com/apache/arrow-nanoarrow/blob/241764644f15f9d9a94754b9d28b556666385bd1/src/nanoarrow/common/inline_buffer.h#L356

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I checked the current generated assembly for AVX2 and it seems fine. The compiler have the information that the size is 4 and it unrolled the loop.

.L3:
# /usr/lib/gcc/x86_64-redhat-linux/16/include/avxintrin.h:837:   return *(__m256d_u *)__P;
	vmovupd	(%rdx,%rax,8), %ymm1	# MEM[(__m256d_u * {ref-all})values$_M_ptr_1 + i_153 * 8], _31
# ../../pandas/_libs/include/pandas/simd/moments_simd.hpp:50:     if (mask[j] != 0) {
	xorl	%edi, %edi	# mask_bits
	cmpb	$0, (%rsi,%rax)	#, MEM[(const element_type &)mask$_M_ptr_87 + i_153 * 1]
	setne	%dil	#, mask_bits
# /usr/lib/gcc/x86_64-redhat-linux/16/include/avxintrin.h:371:   return (__m256d) __builtin_ia32_cmppd256 ((__v4df)__X, (__v4df)__Y,
	vcmppd	$3, %ymm1, %ymm1, %ymm6	#, _31, _31, _32
# ../../pandas/_libs/include/pandas/simd/moments_simd.hpp:50:     if (mask[j] != 0) {
	cmpb	$0, 1(%rsi,%rax)	#, MEM[(const element_type &)mask$_M_ptr_87 + 1 + i_153 * 1]
	je	.L16	#,
# ../../pandas/_libs/include/pandas/simd/moments_simd.hpp:51:       mask_bits |= (1ULL << j);
	orq	$2, %rdi	#, mask_bits
.L16:
# ../../pandas/_libs/include/pandas/simd/moments_simd.hpp:50:     if (mask[j] != 0) {
	cmpb	$0, 2(%rsi,%rax)	#, MEM[(const element_type &)mask$_M_ptr_87 + 2 + i_153 * 1]
	je	.L17	#,
# ../../pandas/_libs/include/pandas/simd/moments_simd.hpp:51:       mask_bits |= (1ULL << j);
	orq	$4, %rdi	#, mask_bits
.L17:
# ../../pandas/_libs/include/pandas/simd/moments_simd.hpp:50:     if (mask[j] != 0) {
	cmpb	$0, 3(%rsi,%rax)	#, MEM[(const element_type &)mask$_M_ptr_87 + 3 + i_153 * 1]
	je	.L18	#,
# ../../pandas/_libs/include/pandas/simd/moments_simd.hpp:51:       mask_bits |= (1ULL << j);
	orq	$8, %rdi	#, mask_bits
.L18:
# /usr/lib/gcc/x86_64-redhat-linux/16/include/avxintrin.h:861:   return *__P;
	movq	_ZZN5xsimd6kernel9from_maskINS_4avx2EEENS_10batch_boolIdT_EERKS5_mRKNS_3avxEE5lut64@GOTPCREL(%rip), %r14	#, tmp1100
	salq	$5, %rdi	#, tmp622
	vmovdqa	(%rdi,%r14), %ymm3	# MEM[(const __m256i * {ref-all})_56], _57
# /usr/lib/gcc/x86_64-redhat-linux/16/include/avxintrin.h:164:   return (__m256d) __builtin_ia32_andnpd256 ((__v4df)__A, (__v4df)__B);
	vandnpd	%ymm6, %ymm3, %ymm6	# _32, _57, tmp624
# ../../pandas/_libs/include/pandas/simd/moments_simd.hpp:236:       if (xsimd::any(nan_value_not_masked)) [[unlikely]] {
	vtestpd	%ymm6, %ymm6	# tmp624, tmp624
	jne	.L250	#,
# ../../pandas/_libs/include/pandas/simd/moments_simd.hpp:244:     if (!skipna && xsimd::any(nan_mask)) [[unlikely]] {
	testb	%r8b, %r8b	# skipna
	jne	.L21	#,
# ../../pandas/_libs/include/pandas/simd/moments_simd.hpp:244:     if (!skipna && xsimd::any(nan_mask)) [[unlikely]] {
	vtestpd	%ymm3, %ymm3	# _57, _57
	jne	.L251	#,

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Its not just about the loop unrolling but also the individual instructions. Particularly when dealing with bits in high performance code, the smallest instruction change may have a large impact (see apache/arrow-nanoarrow#280)

Its pretty tough to microbenchmark those things in pandas, which is why I think relying on a third party library that has done it and is better suited to maintain is the better approach.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, it can be optimized. I didn't create a micro benchmark for this, but created a simple benchmark to evaluate several approaches

Benchmark code and result

import time
import numpy as np
from pandas.core.nanops import nanskew

rng = np.random.default_rng(505)

size = 10_000_000
mask_prob = 0.2
nbench = 10

bench_times = np.zeros(nbench)

values = rng.standard_normal(size=size)
mask = rng.binomial(n=1, p=mask_prob, size=size).astype(np.bool_)

nanskew(values, axis=None, skipna=True, mask=mask)

for i in range(nbench):
    t0 = time.perf_counter()
    nanskew(values, axis=None, skipna=True, mask=mask)
    bench_times[i] = time.perf_counter() - t0

print(f"AVG: {bench_times.mean():.5f}", end=" ")
print(f"RNG: {bench_times.max() - bench_times.min():.5f}", end=" ")
print(f"STD: {bench_times.std():.5f}")

# Benchmark Results (Lower is better)
#
# BaseLine (branch):    AVG: 0.02517 RNG: 0.00066 STD: 0.00017
# Arrow Adaption:       AVG: 0.01952 RNG: 0.00039 STD: 0.00011
# neq 0 lshif template: AVG: 0.01951 RNG: 0.00042 STD: 0.00011
# neq 0 lshif:          AVG: 0.01934 RNG: 0.00053 STD: 0.00016
# assume mask 0/1:      AVG: 0.01889 RNG: 0.00056 STD: 0.00015

neq 0 lshif

--- a/pandas/_libs/include/pandas/simd/moments_simd.hpp
+++ b/pandas/_libs/include/pandas/simd/moments_simd.hpp
@@ -47,9 +47,7 @@ inline void moments_add_valuem(Moments &moments, double val, int max_moment) {
 inline uint64_t get_mask_bits(std::span<const uint8_t> mask) {
   uint64_t mask_bits = 0;
   for (std::size_t j = 0; j < mask.size(); j++) {
-    if (mask[j] != 0) {
-      mask_bits |= (1ULL << j);
-    }
+    mask_bits |= static_cast<uint64_t>(mask[j] != 0) << j;
   }
   return mask_bits;
 }

Arrow adaptation

--- a/pandas/_libs/include/pandas/simd/moments_simd.hpp
+++ b/pandas/_libs/include/pandas/simd/moments_simd.hpp
@@ -44,12 +44,27 @@ inline void moments_add_valuem(Moments &moments, double val, int max_moment) {
   moments.mean += delta_n;
 }
 
+/// Pack bits from boolean mask.
+///
+/// Adapted from nanoarrow.
+/// <https://github.com/apache/arrow-nanoarrow/blob/241764644f15f9d9a94754b9d28b556666385bd1/src/nanoarrow/common/inline_buffer.h#L356-L361>
+template <std::size_t N>
 inline uint64_t get_mask_bits(std::span<const uint8_t> mask) {
+  static_assert(N == 2 || N == 4 || N == 8);
   uint64_t mask_bits = 0;
-  for (std::size_t j = 0; j < mask.size(); j++) {
-    if (mask[j] != 0) {
-      mask_bits |= (1ULL << j);
-    }
+  if constexpr (N >= 2) {
+    mask_bits |= mask[0] != 0;
+    mask_bits |= (mask[1] + 0x1) & 0x2;
+  }
+  if constexpr (N >= 4) {
+    mask_bits |= (mask[2] + 0x3) & 0x4;
+    mask_bits |= (mask[3] + 0x7) & 0x8;
+  }
+  if constexpr (N >= 8) {
+    mask_bits |= (mask[4] + 0xF) & 0x10;
+    mask_bits |= (mask[5] + 0x1F) & 0x20;
+    mask_bits |= (mask[6] + 0x3F) & 0x40;
+    mask_bits |= (mask[7] + 0x7F) & 0x80;
   }
   return mask_bits;
 }
@@ -229,7 +244,7 @@ Moments accumulate_moments_simd::operator()(
     auto nan_mask = xsimd::isnan(val);
 
     if (mask.has_value()) {
-      uint64_t mask_bits = detail::get_mask_bits(mask->subspan(i, step));
+      uint64_t mask_bits = detail::get_mask_bits<step>(mask->subspan(i, step));
       auto mmask = xsimd::batch_bool<double, Arch>::from_mask(mask_bits);
       auto nan_value_not_masked = xsimd::bitwise_andnot(nan_mask, mmask);

neq 0 lshift template

--- a/pandas/_libs/include/pandas/simd/moments_simd.hpp
+++ b/pandas/_libs/include/pandas/simd/moments_simd.hpp
@@ -44,12 +44,27 @@ inline void moments_add_valuem(Moments &moments, double val, int max_moment) {
   moments.mean += delta_n;
 }
 
+/// Pack bits from boolean mask.
+///
+/// Adapted from nanoarrow.
+/// <https://github.com/apache/arrow-nanoarrow/blob/241764644f15f9d9a94754b9d28b556666385bd1/src/nanoarrow/common/inline_buffer.h#L356-L361>
+template <std::size_t N>
 inline uint64_t get_mask_bits(std::span<const uint8_t> mask) {
+  static_assert(N == 2 || N == 4 || N == 8);
   uint64_t mask_bits = 0;
-  for (std::size_t j = 0; j < mask.size(); j++) {
-    if (mask[j] != 0) {
-      mask_bits |= (1ULL << j);
-    }
+  if constexpr (N >= 2) {
+    mask_bits |= mask[0] != 0;
+    mask_bits |= (mask[1] != 0) << 1;
+  }
+  if constexpr (N >= 4) {
+    mask_bits |= (mask[2] != 0) << 2;
+    mask_bits |= (mask[3] != 0) << 3;
+  }
+  if constexpr (N >= 8) {
+    mask_bits |= (mask[4] != 0) << 4;
+    mask_bits |= (mask[5] != 0) << 5;
+    mask_bits |= (mask[6] != 0) << 6;
+    mask_bits |= (mask[7] != 0) << 7;
   }
   return mask_bits;
 }
@@ -229,7 +244,7 @@ Moments accumulate_moments_simd::operator()(
     auto nan_mask = xsimd::isnan(val);
 
     if (mask.has_value()) {
-      uint64_t mask_bits = detail::get_mask_bits(mask->subspan(i, step));
+      uint64_t mask_bits = detail::get_mask_bits<step>(mask->subspan(i, step));
       auto mmask = xsimd::batch_bool<double, Arch>::from_mask(mask_bits);
       auto nan_value_not_masked = xsimd::bitwise_andnot(nan_mask, mmask);

assume mask 0/1

--- a/pandas/_libs/include/pandas/simd/moments_simd.hpp
+++ b/pandas/_libs/include/pandas/simd/moments_simd.hpp
@@ -47,9 +47,8 @@ inline void moments_add_valuem(Moments &moments, double val, int max_moment) {
 inline uint64_t get_mask_bits(std::span<const uint8_t> mask) {
   uint64_t mask_bits = 0;
   for (std::size_t j = 0; j < mask.size(); j++) {
-    if (mask[j] != 0) {
-      mask_bits |= (1ULL << j);
-    }
+    assert(mask[j] == 0 || mask[j] == 1);
+    mask_bits |= (mask[j] << j);
   }
   return mask_bits;
 }

Let me know which patch you prefer.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's just add nanoarrow and use it - no reason for us to hold this as an optimization

double n = xsimd::reduce_add(nobs);
assert(n >= 0);
result.n = static_cast<std::uint64_t>(n);
if (n == 0) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we have an assert that n >= 0 two lines above this but then branch on the n == 0 case? I think one of those is off

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I should be using result.n in the if branch. But I don't see anything problematic with the checks. The assertion shows that n should be non-negative. In case n is 0, there is nothing to do and simply return the Moments struct zero initialized.

batch_type delta2_n = delta_n * delta_n;

if (max_moment >= 4) {
const batch_type m3_term = batch_type(-4.0) * m3;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These would also be good places to use auto instead of repeating the type specifier

nobs, mean, m2, m3, m4, max_moment);
Moments tail = accumulate_moments_simd{}(
xsimd::common{}, values.subspan(vec_size, tail_size), skipna,
mask ? std::optional(mask->subspan(vec_size, tail_size)) : std::nullopt,
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you move this expression out of the function call? It would help with readability

std::optional<std::span<const uint8_t>>, int) noexcept;
#endif

using arch_list = xsimd::arch_list<>
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it not possible to fold these macros into the ones above so they don't need to be repeated? I'm not sure of the syntax but it feels like it should be possible, possibly with constexpr

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried to reduce the #if macro usage by creating a list incrementally with using l{x+1} = lx, but personally, I don't like it. IMO, an X macro is more appropriate here.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at xsimd I'm not sure we even need these in the first place (?). Doesn't ::add internally take care of checking if an instruction set is supported?

https://github.com/xtensor-stack/xsimd/blob/a9039449fdfd3cb4816c6c33c45deebf7183af29/include/xsimd/config/xsimd_arch.hpp#L136

If not, then we should at the very least use the constexpr system ala xsimd to add these

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at xsimd I'm not sure we even need these in the first place (?). Doesn't ::add internally take care of checking if an instruction set is supported?

I got a runtime error when I removed the macros.

ImportError: /home/alvaro/oss/pandas/build/cp314/pandas/_libs/algos.cpython-314-x86_64-linux-gnu.so: undefined symbol: _ZN6pandas7moments23accumulate_moments_simdclIN5xsimd6neon64
EEE7MomentsT_St4spanIKdLm18446744073709551615EEbSt8optionalIS7_IKhLm18446744073709551615EEEi

If not, then we should at the very least use the constexpr system ala xsimd to add these

It's still not clear to me how to remove these configuration macros. arch_list is a type and should be constructed with a single statement.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can see where that is confusing - it looks like xsimd offers a xsimd::detail::supported type which can be used to filter an arch_list from a set of types to those that are actually supported. I'm not sure why that is in the detail namespace, but it appears to have the missing functionality.

You may want to ask upstream why that isn't public, and if there's a desire to make it so in the future. In the meantime you could still just use it, as its available in a header file.

Keep in mind that if you stick to that pattern, you can leverage other parts of the static type system for xsimd, like the xsimd::arch_list::best. Things like that are very difficult to represent with macros, so we want to implement design patterns that we will use consistently

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's an entire section in the core C++ guidelines about avoiding the use of macros.

https://isocpp.github.io/CppCoreGuidelines/CppCoreGuidelines#es33-if-you-must-use-macros-give-them-unique-names

They show some lenience in using macros for configuration control (this case) in ES.30

Note This rule does not ban the use of macros for “configuration control” use in #ifdefs, etc.

In the future, modules are likely to eliminate the need for macros in configuration control.

Here's some example code you can reference

Thanks for the example, I thought in using something like this, but again, it would rely on macros to define these constexpr variables, since meson configuration is designed to define macros.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the example, I thought in using something like this, but again, it would rely on macros to define these constexpr variables, since meson configuration is designed to define macros.

You are correct that the config file is where this needs to be done, but I think are missing the way to do it. Essentially you would make the constexpr items live in an input template, which here can be pandas_simd_config.h.in. The contents would look something like:

#pragma once

static constexpr bool kHasAVX2 = @PANDAS_HAVE_AVX2@;
static constexpr bool kHasSSE2 = @PANDAS_HAVE_SSE2@;
...

and then provide that input template as an argument to configure_file

configure_file(
  input: 'pandas_simd_config.h.in',
  output: 'pandas_simd_config.h',
  ...
)

Copy link
Copy Markdown
Member Author

@Alvaro-Kothe Alvaro-Kothe May 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Essentially you would make the constexpr items live in an input template

I'm not sure about using an input template for constexpr here. Meson's configure_file (like CMake's) is designed, mainly, to generate macro headers where keys can be left undefined.

Moving to an .in template with constexpr booleans adds another file to keep track of supported micro architectures. It also adds boilerplate, forcing us to track unsupported architectures in the build script, whereas the current setup is much simpler: it only cares about what is actually supported by the compiler.

The std::conditional_t approach in #64905 (comment) also adds complexity. Using void as a placeholder for unsupported architectures breaks xsimd::dispatch:

#include <iostream>
#include <xsimd/xsimd.hpp>

constexpr bool hasAvx2 = true;
constexpr bool hasSse2 = false;

struct S {
    template <class Arch>
    void operator()(Arch arch) {
        std::cout << "Dispatched to " << arch.name() << std::endl;
    }
};

int main() {
  using arch_list = xsimd::arch_list<>::extend < std::conditional_t<hasAvx2, xsimd::avx2, void>,
                                                 std::conditional_t<hasSse2, xsimd::sse2, void>
                                                 >;

    auto f = xsimd::dispatch<arch_list>(S{});
    f();

    return 0;
}
subprojects/xsimd-14.2.0/include/xsimd/memory/../config/xsimd_arch.hpp: In instantiation of ‘auto xsimd::detail::dispatcher<F, ArchList>::walk_archs(xsimd::arch_list<Arch>, Tys&& ...) [with Arch = void; Tys = {}; F = S; ArchList = xsimd::arch_list<xsimd::avx2, void>]’:
subprojects/xsimd-14.2.0/include/xsimd/memory/../config/xsimd_arch.hpp:215:38:   required from ‘auto xsimd::detail::dispatcher<F, ArchList>::walk_archs(xsimd::arch_list<Arch, ArchNext, Archs...>, Tys&& ...) [with Arch = xsimd::avx2; ArchNext = void; Archs = {}; Tys = {}; F = S; ArchList = xsimd::arch_list<xsimd::avx2, void>]’
  215 |                     return walk_archs(arch_list<ArchNext, Archs...> {}, std::forward<Tys>(args)...);
      |                            ~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
subprojects/xsimd-14.2.0/include/xsimd/memory/../config/xsimd_arch.hpp:228:34:   required from ‘auto xsimd::detail::dispatcher<F, ArchList>::operator()(Tys&& ...) [with Tys = {}; F = S; ArchList = xsimd::arch_list<xsimd::avx2, void>]’
  228 |                 return walk_archs(ArchList {}, std::forward<Tys>(args)...);
      |                        ~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
t.cpp:20:6:   required from here
   20 |     f();
      |     ~^~
subprojects/xsimd-14.2.0/include/xsimd/memory/../config/xsimd_arch.hpp:205:39: error: incomplete type ‘void’ used in nested name specifier
  205 |                 assert(Arch::available() && "At least one arch must be supported during dispatch");
      |                        ~~~~~~~~~~~~~~~^~
subprojects/xsimd-14.2.0/include/xsimd/memory/../config/xsimd_arch.hpp:206:31: error: invalid use of void expression
  206 |                 return functor(Arch {}, std::forward<Tys>(args)...);
      |                        ~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Also, macros keep this configuration accessible to C and Cython, if necessary.

Even the CppCoreGuidelines allow for this:

Enforcement: Scream when you see a macro that isn’t just used for source control (e.g., #ifdef)

IMO, using template headers with constexpr for the sake of avoiding macros feels like sacrificing simplicity and portability without a concrete benefit.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Meson's configure_file (like CMake's) is designed, mainly, to generate macro headers where keys can be left undefined.

This is a very common pattern with both Meson and CMake. The latter always requires you to specify a file like this. We just generating a header.

The std::conditional_t approach in #64905 (comment) also adds complexity. Using void as a placeholder for unsupported architectures breaks xsimd::dispatch:

It looks like xsimd offers an xsimd::unavailable sentinel - if you use that in place of void in your example it will compile and run.

Also, macros keep this configuration accessible to C and Cython, if necessary.

If you wanted to you could name this output .hpp to clarify that it is only used for C++ projects. I see what you are saying, but C / C++ are separate languages; discarding functionality of the latter for compatibility with the former is not worthwhile

Even the CppCoreGuidelines allow for this:

Enforcement: Scream when you see a macro that isn’t just used for source control (e.g., #ifdef)

This is exactly the problem. You are using macros when you are trying to control a type. Both xsimd and C++ offer robust compile time type manipulations that you are forgoing.

IMO, using template headers with constexpr for the sake of avoiding macros feels like sacrificing simplicity and portability without a concrete benefit.

How does this sacrifice portability? This is all part of the C++ standard.

I will never claim that C++ is a simple language. However, its features do exist for good reason

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a very common pattern with both Meson and CMake. The latter always requires you to specify a file like this. We just generating a header.

What I meant is that it's possible to don't define some keys if you are using #mesondefine or #cmakedefine, here is from meson docs

/* undef TOKEN */ // If TOKEN has not been set to any value.

Using constexpr forces to define every architecture that the compiler support and doesn't support. For me, it's better to don't fight the build system on this and define this constexpr variables using the macros.

It looks like xsimd offers an xsimd::unavailable sentinel - if you use that in place of void in your example it will compile and run.

It seems that xsimd::unavailable can't be a placeholder for unavailable architectures: https://godbolt.org/z/PExG4raPf

The alternative is that stair of type definitions.

You are using macros when you are trying to control a type

Fair point, but the end goal is to define dispatch logic.

It seems that a function pointer like pattern may fit better with what you want, since, apparently, it doesn't implicitly instantiate inside a false if constexpr block (haven't checked MSVC): https://godbolt.org/z/ejbcTn8rq

How does this sacrifice portability? This is all part of the C++ standard.

I meant it by portability to be used in C and Cython.


template <class Arch>
Moments accumulate_moments_simd::operator()(
Arch, std::span<const double> values, int skipna,
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Arch, std::span<const double> values, int skipna,
Arch, std::span<const double> values, bool skipna,

nobs = xsimd::rotate_left<1>(nobs);
delta += nobs * (mean - mean_copy);
}
// rotate once more to go back to original position
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be best to only have this template accept batch_type and detect the step size from it again; otherwise we are introducing an invariant that the caller always provides both, and that isn't clear/documented

@Alvaro-Kothe Alvaro-Kothe force-pushed the perf/skew-kurt-omp-xsimd branch from 3975564 to 4236d9f Compare May 7, 2026 19:04

inline uint64_t get_mask_bits(std::span<const uint8_t> mask) {
uint64_t mask_bits = 0;
for (std::size_t j = 0; j < mask.size(); j++) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I see - that invariant isn't super clear.

If this is in a critical path then looping like this is a pretty slow way of packing these bits. You should look at the nanoarrow implementation (would also be OK to include here)

https://github.com/apache/arrow-nanoarrow/blob/241764644f15f9d9a94754b9d28b556666385bd1/src/nanoarrow/common/inline_buffer.h#L356

}

/// https://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#Higher-order_statistics
inline void moments_merge(Moments &acc, const Moments &src, int max_moment) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

inline is required for linkage.

I think you want static then?

}

using l0 = xsimd::arch_list<>;
#if PANDAS_HAVE_AVX512CD
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm so nothing with constexpr helped here at all huh? The macro usage is a bit heavy-handed

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I forgot to do the simplest thing... I reverted back to the way that I was building the arch_list, but removed the if directives around the external linkage commands.

@Alvaro-Kothe Alvaro-Kothe force-pushed the perf/skew-kurt-omp-xsimd branch 2 times, most recently from 3256fd3 to 9cda6c0 Compare May 8, 2026 19:05

/// Pack bits from boolean mask.
///
/// Adapted from nanoarrow.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wasn't asking to vendor this but rather just add nanoarrow and call the function. There's likely many more uses for nanoarrow in the future anyway, if you plan on doing more high performance calls like this in C++

///
/// Adapted from nanoarrow.
/// <https://github.com/apache/arrow-nanoarrow/blob/241764644f15f9d9a94754b9d28b556666385bd1/src/nanoarrow/common/inline_buffer.h#L356-L361>
template <std::size_t N>
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The proper way to call this would be to unpack 64 bits at a time and adjust the calling logic appropriately. Only unpacking 2 bits at a time is extremely slow

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was simpler to reason about processing the mask at the same rate as I process the values.

The proper way to call this would be to unpack 64 bits

IMO, it'll be more natural (and probably more performant) to use SIMD instructions to process the mask instead of unpacking it into a 64 bit bitmask. I'll check if it's feasible.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a little bit rough on the edges, but seems to be working. The core idea is to use xsimd::widen to distribute the batches of uint8_t into batches of uint64_t for then to be converted into packed doubles.

Copy link
Copy Markdown
Member

@WillAyd WillAyd May 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm why is a double required? That's a huge amount of memory overhead, particularly if starting with a bitmask

If xsimd can't support a bit/byte mask conversion then just stick with nanoarrow - it seems unlikely that widening further is worth it

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm why is a double required? That's a huge amount of memory overhead, particularly if starting with a bitmask

In SSE2, there aren't instructions that uses a bitmask. Instead, masking is performed using vector masks, where each lane must match the width of the data being processed. Since we are calculating on double values, any mask must be widened to 64 bits per element.

If xsimd can't support a bit/byte mask conversion then just stick with nanoarrow - it seems unlikely that widening further is worth it

If we were to use a bitmask and use xsimd::from_mask, xsimd's current implementation for AVX2 constructs an array and performs a load
(https://github.com/xtensor-stack/xsimd/blob/a9039449fdfd3cb4816c6c33c45deebf7183af29/include/xsimd/arch/xsimd_avx.hpp#L616-L639).

Which seems worse than widening IMO, since it has to perform a lot more loads and comparisons to build the bitmask, then for each block it must perform a bitshift + land to generate the correct bitmask for the batch and finally will end up constructing the mask manually.

That's a huge amount of memory overhead

If memory pressure is a concern, probably it's better to write this function directly in accumulate_moments_simd_masked_impl to reduce register pressure, since apparently this function wasn't inlined:

generated assembly

_ZN6pandas7moments6detail17convert_u8_to_u64IN5xsimd4avx2EEESt5arrayINS3_5batchImT_EELm8EENS6_IhS7_EE:
.LFB16169:
	.cfi_startproc
	vmovdqa	%xmm0, %xmm1
	vextracti128	$0x1, %ymm0, %xmm0
	movq	%rdi, %rax
	vpmovzxbw	%xmm1, %ymm1
	vpmovzxbw	%xmm0, %ymm0
	vmovdqa	%xmm1, %xmm3
	vmovdqa	%xmm0, %xmm2
	vextracti128	$0x1, %ymm1, %xmm1
	vextracti128	$0x1, %ymm0, %xmm0
	vpmovzxwd	%xmm3, %ymm3
	vpmovzxwd	%xmm1, %ymm1
	vpmovzxwd	%xmm2, %ymm2
	vpmovzxwd	%xmm0, %ymm0
	vpmovzxdq	%xmm3, %ymm6
	vpmovzxdq	%xmm1, %ymm7
	vpmovzxdq	%xmm2, %ymm5
	vpmovzxdq	%xmm0, %ymm4
	vmovdqa	%ymm6, (%rdi)
	vextracti128	$0x1, %ymm3, %xmm3
	vextracti128	$0x1, %ymm1, %xmm1
	vextracti128	$0x1, %ymm2, %xmm2
	vmovdqa	%ymm7, 64(%rdi)
	vextracti128	$0x1, %ymm0, %xmm0
	vpmovzxdq	%xmm3, %ymm3
	vpmovzxdq	%xmm1, %ymm1
	vmovdqa	%ymm5, 128(%rdi)
	vpmovzxdq	%xmm2, %ymm2
	vpmovzxdq	%xmm0, %ymm0
	vmovdqa	%ymm3, 32(%rdi)
	vmovdqa	%ymm1, 96(%rdi)
	vmovdqa	%ymm2, 160(%rdi)
	vmovdqa	%ymm4, 192(%rdi)
	vmovdqa	%ymm0, 224(%rdi)
	ret

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we are making this very complicated; nanoarrow has a function for packing/unpacking as needed that solves the problem. Let's just start with that and leave it to a future enhancement to do something else

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am using xsimd function batch_bool.mask from xsimd to pack the bits.

nanoarrow has a function for packing/unpacking as needed that solves the problem. Let's just start with that and leave it to a future enhancement to do something else

The main problem is that this PR is already introducing a lot of changes.

  1. Adding a new C++ dependency
  2. Adding SIMD
  3. Increasing the standard to C++ 20

Adding arrow/nanoarrow as a C++ dependency will ramp up complexity more than it should IMO for a PR whose goal is to start introducing SIMD in the codebase.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also am trying to minimize the amount of change. Nanoarrow is installable via:

meson wrap install nanoarrow

can be added as a dependency via:

nanoarrow_dep = dependency('nanoarrow')

and can replace all of the bit packing/unpacking code that you have here.

Overall I feel like that is a much simpler solution, and it prevents us from maintaining bit fiddling code ourselves.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can replace all of the bit packing/unpacking code that you have here.

The bit unpacking in here seems minimal and uses SIMD instructions specific for the architecture in use.

Load and compare several values in the mask at once here:

    const mask_batch_type mask8 = xsimd::load_unaligned<A>(&mask[right]);
    xsimd::batch_bool<uint8_t, A> isna_mask = mask8 != mask_batch_type(0U);
    if (!xsimd::any(isna_mask)) {
      continue;
    }

Pack it with std::uint64_t isna_bitmask = isna_mask.mask();

And create a mask with

      auto isna_pd = xsimd::batch_bool<double, A>::from_mask(
          (isna_bitmask >> i) & ((step * 2) - 1));

@jbrockmendel
Copy link
Copy Markdown
Member

I haven't followed too closely and am happy to defer to you two on most of this. Just want to comment that our Bus Factor is higher in c than cpp (and higher yet in cython). Shouldn't be the main factor, but if there's need of a tie-breaking factor.

@Alvaro-Kothe Alvaro-Kothe force-pushed the perf/skew-kurt-omp-xsimd branch from 8659a91 to 8ca9a2c Compare May 12, 2026 02:01
std::optional<std::span<const uint8_t>>, int) noexcept;
#endif

using arch_list = xsimd::arch_list<>
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can see where that is confusing - it looks like xsimd offers a xsimd::detail::supported type which can be used to filter an arch_list from a set of types to those that are actually supported. I'm not sure why that is in the detail namespace, but it appears to have the missing functionality.

You may want to ask upstream why that isn't public, and if there's a desire to make it so in the future. In the meantime you could still just use it, as its available in a header file.

Keep in mind that if you stick to that pattern, you can leverage other parts of the static type system for xsimd, like the xsimd::arch_list::best. Things like that are very difficult to represent with macros, so we want to implement design patterns that we will use consistently

@Alvaro-Kothe Alvaro-Kothe force-pushed the perf/skew-kurt-omp-xsimd branch from 3510676 to 381be60 Compare May 13, 2026 00:24
@Alvaro-Kothe Alvaro-Kothe force-pushed the perf/skew-kurt-omp-xsimd branch 2 times, most recently from 2d45dd6 to f847dbc Compare May 19, 2026 22:10
perf: use simd instructions directly instead of intermediate array

build: remove some meson dependencies

build: retry not copying sources

fix: fix type mismatch for moments

fix: include cmath for nan

fix: make conversion explicit

refactor: remove implicitly included directory

fix: remove extern C in cpp file

refactor: remove unused headers

fix: address review comments in moments.cpp

build: add minimum meson version for xsimd

fix: add maybe_unused directive

fix: add missing files

refactor: remove maybe_unused directive

refactor: use separate files for instantiation

refactor: remove MOMENTS_* macros

fix: bump to c++20 and use optional, span and unlikely

fix: use relative tolerance for assertion

refactor: move moments_merge to c++

refactor: change `n` member to `size_t` in `Moments` struct

refactor: organize math expressions

refactor: remove static from moments_add_valuem

fix: use result.n in if branch

refactor: use auto

refactor: move subspan outside of function call

refactor: remove step outside of template

refactor: use bool for skipna

refactor: deduplicate #if macro usage

fix: try to fix maybe_unitialized warning

refactor: mark as external and don't use it

fix: ensure internal linkage in simd module

perf: use nanoarrow method to pack bits

refactor: improve naming

refactor: increase similarity between moments update functions

perf: use simd instructions to handle mask directly

use bitmask

chore: improve readability

- Reduce code duplication
- Improve whitespace distribution
- Improve naming

refactor: use struct to store batches accumulators

perf: fix Wnrvo warning
@Alvaro-Kothe Alvaro-Kothe force-pushed the perf/skew-kurt-omp-xsimd branch from f847dbc to eca6f31 Compare May 20, 2026 00:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Build Library building on various platforms Performance Memory or execution speed performance

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants