Skip to content

linalg/x86_64: AVX-512_FP16 native f16 hardswish kernel (stacked on #2310)#2313

Open
czoli1976 wants to merge 3 commits into
sonos:mainfrom
czoli1976:feat/avx512fp16-native-f16
Open

linalg/x86_64: AVX-512_FP16 native f16 hardswish kernel (stacked on #2310)#2313
czoli1976 wants to merge 3 commits into
sonos:mainfrom
czoli1976:feat/avx512fp16-native-f16

Conversation

@czoli1976
Copy link
Copy Markdown
Contributor

Summary

Stacked on #2310. Adds a native AVX-512_FP16 path for the f16 hardswish kernel on Sapphire Rapids / Granite Rapids / later Intel parts. Computes f16 directly in zmm registers (32 lanes per zmm) using vaddph / vminph / vmaxph / vmulph — no vcvtph2ps / vcvtps2ph round-trip, no f32 scratch buffer.

Plugged via a new plug_avx512fp16(ops) step that runs after plug_avx512f on hosts where is_x86_feature_detected!("avx512fp16") is true. Pre-FP16 AVX-512 hosts (Skylake-X, Cascade Lake, Ice Lake server prior to the fp16 extension) keep using the existing f32-roundtrip hardswish_f16 kernel from #2310's act_f16.rs unchanged.

What's in this PR

hardswish_f16_128n is the only kernel wired. A native leaky_relu_f16_128n is also included but NOT wired — on Sapphire Rapids it benched 38% SLOWER than #2310's f32-roundtrip version (5.85 vs 9.44 Gelem/s, n=1024, single thread). The two-op-per-element compute path (vmulph + vmaxph) appears not to saturate Sapphire Rapids' FP16 execution port the same way the equivalent f32 ops saturate the FP32 ports. The kernel is correct (4/4 frame tests pass, including proptest against the f16 reference); kept in source for future revisit on a different fp16 µarch (Granite Rapids etc.) where the comparison might flip.

Other f16 activations (sigmoid, tanh, silu, gelu) are not ported — they require polynomial approximations whose precision in native f16 (11-bit mantissa vs 24-bit f32) needs separate validation. Out of scope for this PR.

Bench (Sapphire Rapids, n=1024, single thread, Criterion)

op generic avx512_f32roundtrip (#2310) avx512fp16_native (this PR) vs #2310
hardswish_f16 52.3 Melem/s 8.71 Gelem/s 31.6 Gelem/s 3.62×
leaky_relu_f16 (NOT wired) 778 Melem/s 9.44 Gelem/s 5.85 Gelem/s 0.62× (regression)

Test plan

  • cargo test --release -p tract-linalg --lib act_f16_fp16 — 4 passed, 0 failed (2 ops × 2 cases: trivial + proptest against scalar f16 reference)
  • cargo test --release -p tract-linalg — 2845 passed, 0 failed
  • cargo bench --bench activations_avx512_fp16 — numbers above
  • Non-FP16 AVX-512 hosts unchanged (only plug_avx512fp16 overrides one slot; the f32-roundtrip path from linalg/x86_64: add AVX-512 f16 element-wise activations (stacked on #2304) #2310 runs unchanged on Skylake-X / Cascade Lake / Ice Lake)
  • Cross-arch builds clean (aarch64-unknown-linux-gnu, wasm32-unknown-unknown): the new act_f16_fp16.rs and plug_avx512fp16 are walled off by #[cfg(target_arch = "x86_64")] on the x86_64_fma module in linalg/src/lib.rs.

Dependencies

Stacked on #2310 (AVX-512 f16 element-wise activations), itself stacked on #2304 (AVX-512 f32 element-wise activations). The native fp16 kernel only fires on top of the f32-roundtrip path from #2310plug_avx512fp16 overrides one hardswish_f16 slot that plug_avx512f set in #2310. Once #2304 and #2310 merge, GitHub auto-collapses this PR's view to show only this commit.

Validation environment

Sapphire Rapids cloud container (Intel Xeon Scalable, AVX-512_FP16 capable; same Ubuntu 24.04 / rustc 1.94.1 toolchain as #2303-#2311's Cascade Lake host).

Co-Authored-By: Claude Opus 4.7 (1M context) noreply@anthropic.com


Generated by Claude Code

@czoli1976
Copy link
Copy Markdown
Contributor Author

Hi @kali — first of all, apologies for the tsunami of AVX-512 PRs. I know seven PRs landing more or less at once is a lot to put on your desk, and the last thing I want is to make the review experience painful. The document below is meant to make picking through them as low-friction as possible: it has the per-PR details, a model-coverage table, end-to-end + concurrency benchmarks, and a suggested landing order with honest caveats.

The short version: validated all PRs end-to-end — they all land on a working build, pass the existing test suites, are runtime-gated on is_x86_feature_detected!("avx512f") so non-AVX-512 hosts (ARM, WASM, older x86) are bit-for-bit unchanged, and they all provide measurable wins at the kernel level. End-to-end, the picture is uneven: some PRs (the f32 activations and softmax/max) carry most of the wall-clock benefit on the ASR + LLM models we tested, while others (erf, RmsNorm, f16 softmax, f16 activations) are kernel-correct + zero-risk but exercise paths that aren't hot at the model sizes we have access to in tract-ci-builds. The detailed table makes it easy to pick which subset is worth landing now vs. holding for a workload that exercises them more.

There is no rush for these and you can take whichever, even none if you think AVX is not useful (or some touch areas which are too small to care to speed up)

Happy to iterate on any specific PR or reorganise the set if any of this doesn't fit the project's review style.

— Ckristian Zoli (@czoli1976) - Perf Maniac

PS: thanks for letting me contribute to TRACT, been very fun !

avx512-prs-review.html

czoli1976 and others added 3 commits May 29, 2026 09:12
De-orphan and fix the latent zmm sigmoid/tanh kernels (tail-loop stride bugs
causing OOB stores for lengths not a multiple of 64), and add AVX-512
hardswish, leaky_relu, plus silu/gelu as compositions over the AVX-512
sigmoid/tanh. Runtime-gated on avx512f; non-AVX512 x86 keeps the FMA/generic
path.

Measured on Cascade Lake (single-thread): sigmoid 1.24x and tanh 1.29x over
the existing FMA paths; hardswish/leaky_relu/silu/gelu 5-21x over the generic
scalar paths.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Add six f16 element-wise activations on x86 AVX-512: sigmoid_f16, tanh_f16,
hardswish_f16, leaky_relu_f16, silu_f16, gelu_f16. Each kernel chunks the
input through a 64-byte-aligned f32 scratch (CHUNK=256), dispatches to the
matching f32 AVX-512 kernel (the avx512_sigmoid_f32 / avx512_tanh_f32
wrappers, or the act:: hardswish / leaky_relu / silu / gelu kernels), and
converts back to f16. silu and gelu compose sigmoid_f32 / tanh_f32 with the
final combine done in f32.

The f16 <-> f32 conversion is driven by vcvtph2ps / vcvtps2ph via std::arch
intrinsics (cvt_f16_to_f32 / cvt_f32_to_f16 helpers); rustc + LLVM do not
autovectorize the scalar f16::to_f32 / f16::from_f32 loops, which is why a
naive port leaves AVX-512 stuck at ~7 Melem/s.

Wires into Ops::{sigmoid,tanh,hardswish,leaky_relu,silu,gelu}_f16 from
plug_avx512f; non-AVX512 x86 keeps the generic scalar f16 kernels. Validated
against the generic H<Op>8 reference via the existing *_frame_tests! macros
at SuperApproximate tolerance, which covers the precision delta between
scalar f16 arithmetic and f32-internal computation.

Measured on Cascade Lake (single-thread, throughput Gelem/s):
  - sigmoid_f16:    0.016 -> 1.54   (96x)
  - tanh_f16:       0.018 -> 1.61   (92x)
  - hardswish_f16:  0.051 -> 9.46   (186x)
  - leaky_relu_f16: 0.96  -> 10.4   (11x; generic baseline is unexpectedly fast)
  - silu_f16:       0.20  -> 0.93   (4.6x)
  - gelu_f16:       0.11  -> 0.75   (6.7x)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds a native f16 hardswish kernel using avx512fp16 ISA (Sapphire Rapids /
Granite Rapids / later Intel). 128 f16 lanes per iteration via 4 zmm of 32 f16
each, processed with vaddph / vminph / vmaxph / vmulph — no f32 round-trip,
no vcvtph2ps/vcvtps2ph at the IO boundary.

Wired through a new `plug_avx512fp16` step that runs after `plug_avx512f` on
hosts where `is_x86_feature_detected!("avx512fp16")` is true. The f32-roundtrip
hardswish_f16 kernel from `act_f16.rs` remains in place as the avx512f-only
fallback (Skylake-X, Cascade Lake, Ice Lake server prior to fp16 extension).

Bench on Sapphire Rapids (n=1024, single thread, Criterion):
  hardswish_f16:
    generic              52.3 Melem/s
    avx512_f32roundtrip   8.71 Gelem/s   (current #8 path)
    avx512fp16_native    31.6 Gelem/s   (this PR, 3.62× over the roundtrip)

A native leaky_relu_f16 kernel is also included but NOT wired — on Sapphire
Rapids it benched 38% slower than the f32-roundtrip version (5.85 vs 9.44
Gelem/s). The two-op-per-element compute path (vmulph + vmaxph) does not
saturate the FP16 execution port the same way the equivalent f32 ops saturate
the FP32 ports. Kernel is correct (4/4 frame tests pass, including proptest
against the f16 reference); kept in the source for future revisit on different
fp16 uarchs where the comparison might flip.

Tests: linalg 2845 passed, 0 failed (+4 new frame tests). Cross-arch
`cargo check` clean on aarch64-unknown-linux-gnu and wasm32-unknown-unknown
(plug_avx512fp16 is x86_64-only and feature-gated).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@czoli1976 czoli1976 force-pushed the feat/avx512fp16-native-f16 branch from c1bd5bc to d40fd79 Compare May 29, 2026 08:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant