Add SSE2 implementations of to_chars and from_chars#190
Merged
pdimov merged 6 commits intoboostorg:developfrom Jan 12, 2026
Merged
Add SSE2 implementations of to_chars and from_chars#190pdimov merged 6 commits intoboostorg:developfrom
to_chars and from_chars#190pdimov merged 6 commits intoboostorg:developfrom
Conversation
4a1133b to
665de40
Compare
Member
|
Thanks very much for all the work you've been doing.
... but please let's not do that. |
665de40 to
bdde014
Compare
Member
Author
|
Ok, I've removed the changes to CMakeLists.txt. |
This adds SSE2 code paths to to_chars_x86.hpp. The performance effect on Intel Golden Cove (Core i7-12700K), gcc 13.3, in millions of to_chars() calls per second with a 16-byte aligned output buffer: Char | Generic | SSE2 | SSE4.1 | AVX2 | AVX10.1 =========+=========+=================+==================+==================+================= char | 202.314 | 564.857 (2.79x) | 1194.772 (5.91x) | 1192.094 (5.89x) | 1191.838 (5.89x) char16_t | 188.532 | 457.281 (2.43x) | 795.798 (4.22x) | 935.016 (4.96x) | 938.368 (4.98x) char32_t | 193.151 | 345.612 (1.79x) | 489.620 (2.53x) | 688.829 (3.57x) | 689.617 (3.57x) Here, Generic column was generated with BOOST_UUID_NO_SIMD defined and SSE2 with -march=x86-64. SSE2 support can be useful in cases when users need to be compatible with the base x86-64 ISA.
This adds SSE2 and SSSE3 code paths to from_chars_x86.hpp. The performance effect on Intel Golden Cove (Core i7-12700K), gcc 13.3, in millions of successful from_chars() calls per second: Char | Generic | SSE2 | SSSE3 | SSE4.1 | AVX2 | AVX512v1 =========+=========+=================+=================+=================+=================+================ char | 40.475 | 327.791 (8.10x) | 465.857 (11.5x) | 555.346 (13.7x) | 504.648 (12.5x) | 539.700 (13.3x) char16_t | 38.757 | 292.048 (7.54x) | 401.117 (10.3x) | 478.574 (12.3x) | 426.188 (11.0x) | 416.205 (10.7x) char32_t | 50.200 | 150.900 (3.01x) | 204.588 (4.08x) | 389.882 (7.77x) | 359.591 (7.16x) | 349.663 (6.97x) In addition, the workarounds to avoid (v)pblendvb instructions have been extended to Intel Haswell and Broadwell, as these microarchitectures have poor performance with these instructions (including the SSE4.1 pblendvb). Two new experimental control macros added: BOOST_UUID_FROM_CHARS_X86_SLOW_PBLENDVB and BOOST_UUID_FROM_CHARS_X86_USE_PBLENDVB. The former indicates that (v)pblendvb instructions are slow and should be avoided on the target microarchitectures. The latter indicates that (v)pblendvb should be used by the implementation. The latter macro is derived from the former and takes precedence. As before, these macros can be used for experimenting and fine tuning performance for specific target CPUs. By default, BOOST_UUID_FROM_CHARS_X86_SLOW_PBLENDVB is defined for Haswell/Broadwell or if AVX is detected. Lastly, made selection between blend-based and shuffle-based character code conversion in various places unified, controlled by a single internal macro BOOST_UUID_DETAIL_FROM_CHARS_X86_USE_BLENDS.
This allows for testing that the ISA-specific code at least compiles, even if running the tests isn't possible. The support is only added to b2, CMake still always compiles and runs the tests to keep using boost_test_jamfile for easier maintenance. In the future, similar support can be added to CMake as well.
The targets verify the respective code paths in SIMD algorithms. The recently added SSE2 paths are already tested in the other, unspecialized jobs. Also added jobs to compile tests with BOOST_UUID_TO_FROM_CHARS_X86_USE_ZMM and BOOST_UUID_FROM_CHARS_X86_USE_VPERMI2B experimental macros defined.
bdde014 to
3af17a4
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
to_chars()calls per second with a 16-byte aligned output buffer:Test source code: uuid_to_chars_perftest.cpp
Compile with:
Full results: uuid_to_chars_perftest.txt
Test source code: uuid_from_chars_perftest.cpp
Compile with:
Full results: uuid_from_chars_perftest.txt
In addition, the workarounds to avoid
(v)pblendvbinstructions have been extended to Intel Haswell and Broadwell, as these microarchitectures have poor performance with these instructions (including the SSE4.1pblendvb).Two new experimental control macros added:
BOOST_UUID_FROM_CHARS_X86_SLOW_PBLENDVBandBOOST_UUID_FROM_CHARS_X86_USE_PBLENDVB. The former indicates that(v)pblendvbinstructions are slow and should be avoided on the target microarchitectures. The latter indicates that(v)pblendvbshould be used by the implementation. The latter macro is derived from the former and takes precedence. As before, these macros can be used for experimenting and fine tuning performance for specific target CPUs. By default,BOOST_UUID_FROM_CHARS_X86_SLOW_PBLENDVBis defined for Haswell/Broadwell or if AVX is detected.Also, made selection between blend-based and shuffle-based character code conversion in various places unified, controlled by a single internal macro
BOOST_UUID_DETAIL_FROM_CHARS_X86_USE_BLENDS.NOTE: The tests used in this PR were modified, so the performance numbers presented here may not be comparable with the previous PRs.
As a side effect of this, CMakeLists.txt no longer uses
boost_test_jamfileas it doesn't support custom logic for test type selection.Also added jobs to compile tests with
BOOST_UUID_TO_FROM_CHARS_X86_USE_ZMMandBOOST_UUID_FROM_CHARS_X86_USE_VPERMI2Bexperimental macros defined.