Skip to content

export Spopcount and use extrinsics when possible#1037

Open
burgerrg wants to merge 5 commits intomainfrom
bburger/popcount
Open

export Spopcount and use extrinsics when possible#1037
burgerrg wants to merge 5 commits intomainfrom
bburger/popcount

Conversation

@burgerrg
Copy link
Copy Markdown
Contributor

@burgerrg burgerrg commented Apr 16, 2026

Addresses #1032

@burgerrg burgerrg requested a review from mflatt April 16, 2026 21:43
@mflatt
Copy link
Copy Markdown
Contributor

mflatt commented Apr 16, 2026

The reason to have popcount implemented in a header file is so uses can be inlined by the C compiler. How about taking the implementation in "popcount.h" and moving it to "mkheader.ss" to that it can be part of a generated "scheme.h"?

Your improvements to use __builtin_popcountll and such seem good to include in an inlined variant of "popcount.h". I don't think it's worth trying to use MSVC's __popcnt64 and having to guard against whether a machine supports it at run time, though; an inlined Spopcount using Spopcount_32 should be pretty fast.

@burgerrg
Copy link
Copy Markdown
Contributor Author

burgerrg commented Apr 17, 2026

Wow, the clang, gcc, and Microsoft Visual Studio 2026 compilers all recognize the "slow" code in popcount.h that computes popcount and emit the instruction for it on x64, x86, and arm64! As a result, we don't need to bother with the builtins at all if we don't want to!

@burgerrg
Copy link
Copy Markdown
Contributor Author

Ugh, if I define static int Spopcount(uptr x) ... in scheme.h, I get compiler warnings on every file that does not use the Spopcount function.

@burgerrg
Copy link
Copy Markdown
Contributor Author

Calling Spopcount as a function does increase the overhead, but only by a few clock cycles. Since it makes scheme.h much simpler, I'm going to use that approach. I will eliminate the __popcnt intrinsics in Windows because they're not easy to use correctly, and the latest Microsoft compiler converts our C code into the popcnt instruction.

@burgerrg
Copy link
Copy Markdown
Contributor Author

I can use #define to the intrinsic when we're compiling with clang/gcc, and otherwise I can call the extern function. Here's what it looks like in a 64-bit machine:

#if defined(__clang__) || defined(__GNUC__)
#define Spopcount(x) __builtin_popcountll((unsigned long long)(x))
#else
EXPORT int S_popcount(uptr);
#define Spopcount S_popcount
#endif

@mflatt
Copy link
Copy Markdown
Contributor

mflatt commented Apr 17, 2026

Ugh, if I define static int Spopcount(uptr x) ... in scheme.h, I get compiler warnings on every file that does not use the Spopcount function.

I think the way to suppress those warnings would be to use static inline instead of just static.

@burgerrg
Copy link
Copy Markdown
Contributor Author

burgerrg commented Apr 17, 2026

Thank you, that worked! What do you think of this in scheme.h for 64-bit machines?

static inline int Spopcount(uptr x) {
#if defined(__clang__) || defined(__GNUC__)
  return __builtin_popcountll((unsigned long)x);
#else
  /* count bits of each 2-bit chunk */
  x = x - ((x >> 1) & 0x5555555555555555ULL);
  /* count bits of each 4-bit chunk */
  x = (x & 0x3333333333333333ULL) + ((x >> 2) & 0x3333333333333333ULL);
  /* count bits of each 8-bit chunk */
  x = x + (x >> 4);
  /* mask out junk */
  x &= 0x0F0F0F0F0F0F0F0FULL;
  /* add all 8-bit chunks */
  return (x * 0x0101010101010101ULL) >> 56;
#endif
}

@mflatt
Copy link
Copy Markdown
Contributor

mflatt commented Apr 17, 2026

I'm suspicious of the cast to unsigned long there. If GCC or Clang is used to compile for 64-bit windows, wouldn't unsigned long be too few bits? Maybe unsigned long long was intended?

I lose track of what unsigned long long means on various platforms. The old "popcount.h" based everything on U32 because that's the same everywhere and consistent with int for __builtin_popcount everywhere — basically, what I was able to reason about confidently. But I'm happy with other approaches if we're sure we have the integer sizes right.

I can believe that the non-intrinsic implementation could work for 64 bits. :) If compilers recognize it and convert it to a popcount instruction, then I'm completely convinced, of course.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants