⚡ Thunderbolt: Softmax — Combine FMA for exp range reduction by bugparty · Pull Request #50 · bugparty/cpu_math_kernels_pri

bugparty · 2026-06-07T20:23:31Z

💡 What:
Combined the two _mm256_fnmadd_ps instructions handling range reduction (r = x - n * ln(2)) in the AVX2 exp function approximation into a single _mm256_fnmadd_ps using the combined ln(2) float constant (0.6931471805599453f). This is implemented in softmax_v6.

🎯 Why:
The previous implementation mathematically split ln(2) into two FMA operations to retain 53-bit precision and avoid catastrophic cancellation. For typical ML softmax kernels where the inputs are already pre-shifted by max_val ($x \leq 0$), this extreme precision splitting is unneeded and unnecessarily lengthens the critical FMA execution chain latency. Removing it improves overall throughput without sacrificing numerical tolerances (1e-4).

🏗️ How:

Copied exp256_ps_v2 to exp256_ps_v3.
Replaced the two-step FMA sequence with: __m256 r = _mm256_fnmadd_ps(n, _mm256_set1_ps(0.6931471805599453f), x);
Added softmax_v6 in softmax.h referencing the new exp logic.
Updated kernel_bench.cpp to register the new variant and test_naive_ops.cpp with a new, full unroll boundary correctness test suite for softmax_v6.

📊 Impact:
Measured throughput improvements on N=1048576 arrays via microbenchmarks (Fixed Memory):

Before (softmax_v5): 4.21 GFLOP/s
After (softmax_v6): 4.46 GFLOP/s
~6% Speedup in GFLOPs.

🖥️ Tested on:
x86-64 AVX2 target, compiled with GCC 13.

🔬 How to reproduce:

cd build && make -j$(nproc) ml_kernel_bench
DISABLE_CPU_BINDING=1 ./ml_kernels/ml_kernel_bench --iters 1000 --sizes 1048576

PR created automatically by Jules for task 17176520017703929985 started by @bugparty

Summary by CodeRabbit

New Features
- Implemented optimized softmax computation for enhanced performance.
Tests
- Added comprehensive tests for softmax validation with improved coverage.
Chores
- Added benchmarking infrastructure for softmax performance measurement.

Combined `ln(2)` splitting into a single FMA instruction in AVX2 Softmax `exp` range reduction to improve throughput by ~6% while retaining necessary precision. Co-authored-by: bugparty <1510776+bugparty@users.noreply.github.com>

google-labs-jules · 2026-06-07T20:23:32Z

👋 Jules, reporting for duty! I'm here to lend a hand with this pull request.

When you start a review, I'll add a 👀 emoji to each comment to let you know I've read it. I'll focus on feedback directed at me and will do my best to stay out of conversations between you and other bots or reviewers to keep the noise down.

I'll push a commit with your requested changes shortly after. Please note there might be a delay between these steps, but rest assured I'm on the job!

For more direct control, you can switch me to Reactive Mode. When this mode is on, I will only act on comments where you specifically mention me with @jules. You can find this option in the Pull Request section of your global Jules UI settings. You can always switch back!

New to Jules? Learn more at jules.google/docs.

For security, I will only act on instructions from the user who triggered this task.

coderabbitai · 2026-06-07T20:23:43Z

📝 Walkthrough

Walkthrough

This PR introduces a new softmax kernel variant (softmax_v6) with an optimized AVX2 exp approximation (exp256_ps_v3) that uses single-FMA range reduction to improve instruction count and latency. The implementation includes vectorized max/sum reduction, 32-wide and 8-wide processing loops, test validation against the naive baseline, and performance benchmarking infrastructure.

Changes

Softmax V6 Kernel and Optimization

Layer / File(s)	Summary
exp256_ps_v3 optimization and documentation `.jules/thunderbolt.md`, `ml_kernels/include/ml_kernels/softmax.h`	Adds optimization notes documenting single-FMA range reduction for `exp256`, then implements `exp256_ps_v3` with clamping, log2(e) scaling, integer exponent conversion, combined constant FMA for `r = x - n*ln(2)`, polynomial evaluation, and `2^n` reconstruction via bit manipulation.
softmax_v6 vectorized implementation `ml_kernels/include/ml_kernels/softmax.h`	Introduces `softmax_v6` that performs vectorized max reduction (32-wide + scalar tail), computes `exp(input - max)` via `exp256_ps_v3` in 32-wide and 8-wide loops into output buffer, reduces vectorized sums to scalar, and normalizes across vector widths plus scalar remainder with early returns for `n==0` and zero-sum cases.
Test and benchmark integration `ml_kernels/src/kernel_bench.cpp`, `ml_kernels/src/test_naive_ops.cpp`	Adds `SoftmaxV6Benchmark` subclass to register performance benchmarking; defines `test_softmax_v6()` to validate softmax_v6 output against softmax_naive on 100-element inputs with `1e-4` closeness tolerance and unit-sum verification; wires test into `main()` execution flow.

Sequence Diagram

sequenceDiagram
  participant Caller
  participant softmax_v6
  participant ReduceMax as Reduce Max
  participant exp256_ps_v3
  participant ReduceSum as Reduce Sum
  participant Output
  Caller->>softmax_v6: call softmax_v6(input, output, n)
  softmax_v6->>ReduceMax: Vectorized loop (32-wide + 8-wide + scalar tail)
  ReduceMax->>softmax_v6: max value
  softmax_v6->>exp256_ps_v3: For each element: exp(x - max) vectorized
  exp256_ps_v3->>exp256_ps_v3: Range reduction, polynomial, scaling
  exp256_ps_v3->>Output: Write unnormalized exp(x - max)
  softmax_v6->>ReduceSum: Vectorized sum reduction (32-wide + 8-wide + scalar)
  ReduceSum->>softmax_v6: sum value
  softmax_v6->>Output: Normalize: y[i] *= 1.0/sum (vectorized + scalar)
  Output->>Caller: Softmax output

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

bugparty/cpu_math_kernels_pri#31: Continues the same AVX2 exp approximation and softmax-kernel progression from exp256_ps_v2/softmax_v5 to the new exp256_ps_v3/softmax_v6 by modifying the shared ml_kernels/softmax.h exp core and adding a corresponding new softmax variant, benchmark, and test.
bugparty/cpu_math_kernels_pri#7: Directly connected through test expectations, as the main PR's test_softmax_v6() validates against softmax_naive, the same baseline reference introduced in the retrieved PR's test coverage.

Poem

🐰 A faster softmax now soars,
With fused constants at its core,
V6 brings the AVX dance—
Exp speeds up at every glance,
Normalized dreams in SIMD's door! 🚀

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 50.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title 'Thunderbolt: Softmax — Combine FMA for exp range reduction' clearly and specifically describes the main change: combining two FMA operations into a single FMA for softmax exp range reduction, which matches the core optimization in the changeset.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch thunderbolt-softmax-fma-17176520017703929985

Warning

There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure.

🔧 Infer (1.2.0)

ml_kernels/src/test_naive_ops.cpp

ml_kernels/src/test_naive_ops.cpp:6:10: fatal error: 'ml_kernels/naive_ops.h' file not found
6 | #include "ml_kernels/naive_ops.h"
| ^~~~~~~~~~~~~~~~~~~~~~~~
1 error generated.
Error: the following clang command did not run successfully:
/opt/infer-linux-x86_64-v1.2.0/lib/infer/facebook-clang-plugins/clang/install/bin/clang-18
@/tmp/coderabbit-infer/518a4fcbf155c4571662f1b96671a65e01c65396-d463b5e2e9e167cc/tmp/clang_command_.tmp.af2c32.txt
++Contents of '/tmp/coderabbit-infer/518a4fcbf155c4571662f1b96671a65e01c65396-d463b5e2e9e167cc/tmp/clang_command_.tmp.af2c32.txt':
"-cc1" "-load"
"/opt/infer-linux-x86_64-v1.2.0/lib/infer/infer/bin/../../facebook-clang-plugins/libtooling/build/FacebookClangPlugin.dylib"
"-add-plugin" "BiniouASTExporter" "-plugin-arg-BiniouASTExporter" "-"
"-plugin-arg-BiniouASTExporter" "PREPEND_CURRENT_DIR=1"
"-plugin-arg-BiniouASTExporter" "MAX_STRING_SIZE=65535" "-cc1" "-triple"
"x86_64-unknown-linux-gnu" "-emit

... [truncated 1112 characters] ...

l/lib/clang/18/include"
"-internal-isystem" "/usr/local/include" "-internal-isystem"
"/usr/lib/gcc/x86_64-linux-gnu/12/../../../../x86_64-linux-gnu/include"
"-internal-externc-isystem" "/usr/include/x86_64-linux-gnu"
"-internal-externc-isystem" "/include" "-internal-externc-isystem"
"/usr/include" "-Wno-ignored-optimization-argument" "-Wno-everything"
"-fdeprecated-macro" "-ferror-limit" "19" "-fgnuc-version=4.2.1"
"-fskip-odr-check-in-gmf" "-fcxx-exceptions" "-fexceptions"
"-D__GCC_HAVE_DWARF2_CFI_ASM=1" "-o"
"/tmp/coderabbit-infer/d463b5e2e9e167cc/file.o" "-x" "c++"
"ml_kernels/src/test_naive_ops.cpp" "-O0" "-fno-builtin" "-include"
"/opt/infer-linux-x86_64-v1.2.0/lib/infer/infer/bin/../lib/clang_wrappers/global_defines.h"
"-Wno-everything"

ml_kernels/src/kernel_bench.cpp

ml_kernels/src/kernel_bench.cpp:14:10: fatal error: 'aligned_buffer.h' file not found
14 | #include "aligned_buffer.h"
| ^~~~~~~~~~~~~~~~~~
1 error generated.
Error: the following clang command did not run successfully:
/opt/infer-linux-x86_64-v1.2.0/lib/infer/facebook-clang-plugins/clang/install/bin/clang-18
@/tmp/coderabbit-infer/518a4fcbf155c4571662f1b96671a65e01c65396-7c4391b9a04596fa/tmp/clang_command_.tmp.414186.txt
++Contents of '/tmp/coderabbit-infer/518a4fcbf155c4571662f1b96671a65e01c65396-7c4391b9a04596fa/tmp/clang_command_.tmp.414186.txt':
"-cc1" "-load"
"/opt/infer-linux-x86_64-v1.2.0/lib/infer/infer/bin/../../facebook-clang-plugins/libtooling/build/FacebookClangPlugin.dylib"
"-add-plugin" "BiniouASTExporter" "-plugin-arg-BiniouASTExporter" "-"
"-plugin-arg-BiniouASTExporter" "PREPEND_CURRENT_DIR=1"
"-plugin-arg-BiniouASTExporter" "MAX_STRING_SIZE=65535" "-cc1" "-triple"
"x86_64-unknown-linux-gnu" "-emit-obj" "-mrelax-all"

... [truncated 1089 characters] ...

all/lib/clang/18/include"
"-internal-isystem" "/usr/local/include" "-internal-isystem"
"/usr/lib/gcc/x86_64-linux-gnu/12/../../../../x86_64-linux-gnu/include"
"-internal-externc-isystem" "/usr/include/x86_64-linux-gnu"
"-internal-externc-isystem" "/include" "-internal-externc-isystem"
"/usr/include" "-Wno-ignored-optimization-argument" "-Wno-everything"
"-fdeprecated-macro" "-ferror-limit" "19" "-fgnuc-version=4.2.1"
"-fskip-odr-check-in-gmf" "-fcxx-exceptions" "-fexceptions"
"-D__GCC_HAVE_DWARF2_CFI_ASM=1" "-o"
"/tmp/coderabbit-infer/7c4391b9a04596fa/file.o" "-x" "c++"
"ml_kernels/src/kernel_bench.cpp" "-O0" "-fno-builtin" "-include"
"/opt/infer-linux-x86_64-v1.2.0/lib/infer/infer/bin/../lib/clang_wrappers/global_defines.h"
"-Wno-everything"

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

🧹 Nitpick comments (3)

ml_kernels/src/test_naive_ops.cpp (1)
184-184: ⚡ Quick win

Use newline opening brace for test_softmax_v6 function body.

Please move the opening { to its own line for this newly added function.
Proposed formatting patch
-void test_softmax_v6() {
+void test_softmax_v6()
+{
As per coding guidelines, **/*.{c,cpp,cc,h,hpp} must “Keep braces on their own lines for function bodies.”
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@ml_kernels/src/test_naive_ops.cpp` at line 184, The function declaration for
test_softmax_v6 places the opening brace on the same line; update the function
definition for test_softmax_v6 so the `{` is moved to its own line (i.e., use a
newline before the function body brace) to comply with the project's brace style
for function bodies.
Source: Coding guidelines
ml_kernels/src/kernel_bench.cpp (1)
335-343: ⚡ Quick win

Apply function brace placement rule in SoftmaxV6Benchmark methods.

The new method definitions use same-line {. Please switch to next-line opening braces to match project C/C++ style.
Proposed formatting patch
-    const char *name() const override { return "softmax_v6"; }
+    const char *name() const override
+    {
+        return "softmax_v6";
+    }
@@
-    void run() override {
+    void run() override
+    {
         ml_kernels::softmax_v6(inputs_[current_idx_].data(), outputs_[current_idx_].data(), inputs_[0].size());
         current_idx_ = (current_idx_ + 1) % pool_size_;
     }
As per coding guidelines, **/*.{c,cpp,cc,h,hpp} must “Keep braces on their own lines for function bodies.”
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@ml_kernels/src/kernel_bench.cpp` around lines 335 - 343, The brace placement
for SoftmaxV6Benchmark's method definitions violates project style: move the
opening braces for name() and run() to their own lines; update the class
SoftmaxV6Benchmark so name() const override and run() override have the function
body opening brace on the next line (i.e., change "const char *name() const
override { ... }" and "void run() override { ... }" to use newline before the
"{"), preserving the existing bodies and current_idx_/pool_size_ logic.
Source: Coding guidelines
ml_kernels/include/ml_kernels/softmax.h (1)
504-530: ⚡ Quick win

Move function opening braces to their own lines in new softmax_v6 additions.

exp256_ps_v3 and softmax_v6 currently keep { on the declaration line. Please align these new function bodies with the repo’s brace rule.
Proposed formatting patch
-inline __m256 exp256_ps_v3(__m256 x) {
+inline __m256 exp256_ps_v3(__m256 x)
+{
@@
-inline void softmax_v6(const float *input, float *output, std::size_t n) {
+inline void softmax_v6(const float *input, float *output, std::size_t n)
+{
As per coding guidelines, **/*.{c,cpp,cc,h,hpp} must “Keep braces on their own lines for function bodies.”

Also applies to: 536-632
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@ml_kernels/include/ml_kernels/softmax.h` around lines 504 - 530, The function
definitions exp256_ps_v3 and softmax_v6 violate the repo brace style by keeping
the opening brace on the same line as the declaration; update each function
declaration so the `{` is moved to its own line (i.e., change "inline __m256
exp256_ps_v3(__m256 x) {" to have the brace on the following line, and do the
same for softmax_v6) to comply with the "Keep braces on their own lines for
function bodies" rule; ensure indentation for the brace and body matches
surrounding code style.
Source: Coding guidelines

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@ml_kernels/include/ml_kernels/softmax.h`:
- Around line 504-530: The function definitions exp256_ps_v3 and softmax_v6
violate the repo brace style by keeping the opening brace on the same line as
the declaration; update each function declaration so the `{` is moved to its own
line (i.e., change "inline __m256 exp256_ps_v3(__m256 x) {" to have the brace on
the following line, and do the same for softmax_v6) to comply with the "Keep
braces on their own lines for function bodies" rule; ensure indentation for the
brace and body matches surrounding code style.

In `@ml_kernels/src/kernel_bench.cpp`:
- Around line 335-343: The brace placement for SoftmaxV6Benchmark's method
definitions violates project style: move the opening braces for name() and run()
to their own lines; update the class SoftmaxV6Benchmark so name() const override
and run() override have the function body opening brace on the next line (i.e.,
change "const char *name() const override { ... }" and "void run() override {
... }" to use newline before the "{"), preserving the existing bodies and
current_idx_/pool_size_ logic.

In `@ml_kernels/src/test_naive_ops.cpp`:
- Line 184: The function declaration for test_softmax_v6 places the opening
brace on the same line; update the function definition for test_softmax_v6 so
the `{` is moved to its own line (i.e., use a newline before the function body
brace) to comply with the project's brace style for function bodies.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 95645985-e70d-4f6e-b021-8ca69687bcd8

📥 Commits

Reviewing files that changed from the base of the PR and between acca01e and 518a4fc.

📒 Files selected for processing (4)

.jules/thunderbolt.md
ml_kernels/include/ml_kernels/softmax.h
ml_kernels/src/kernel_bench.cpp
ml_kernels/src/test_naive_ops.cpp

⚡ Thunderbolt: AVX2 Softmax FMA Optimization

518a4fc

Combined `ln(2)` splitting into a single FMA instruction in AVX2 Softmax `exp` range reduction to improve throughput by ~6% while retaining necessary precision. Co-authored-by: bugparty <1510776+bugparty@users.noreply.github.com>

coderabbitai Bot reviewed Jun 7, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

⚡ Thunderbolt: Softmax — Combine FMA for exp range reduction#50

⚡ Thunderbolt: Softmax — Combine FMA for exp range reduction#50
bugparty wants to merge 1 commit into
mainfrom
thunderbolt-softmax-fma-17176520017703929985

bugparty commented Jun 7, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

google-labs-jules Bot commented Jun 7, 2026

Uh oh!

coderabbitai Bot commented Jun 7, 2026 •

edited

Loading

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

Possibly related PRs

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

bugparty commented Jun 7, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

google-labs-jules Bot commented Jun 7, 2026

Uh oh!

coderabbitai Bot commented Jun 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

Possibly related PRs

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

bugparty commented Jun 7, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Jun 7, 2026 •

edited

Loading