Channel-parallel float dot for high channel counts (perf C6)#19
Merged
Conversation
Frame-major history + register-blocked 8/4/2/1 channel tiles when channels >= 4, floating-point samples, host targets only (compile-time gate; every embedded ratchet scenario verified 0.00%). Bit-exact vs planar (per-channel tap order unchanged; hash-verified over 30k blocks x 4 configs). Same-minute A/B: float 8/12/16ch -38/-38/-42% wall-clock with AVX2+FMA, -4-5% on baseline SSE2. Fixed-point measured ~1.5x slower channel-parallel and keeps planar (taps-axis auto-vectorization already optimal) - negative result recorded in PERFORMANCE.md along with the two implementation traps (memory accumulators; runtime gate in hot loops). https://claude.ai/code/session_01HuAFfoeD5a5Xe5aGNA16M9
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Roadmap item C6 — channel-parallel dot products for the high-channel-count deployments (12ch 7.1.4, 16ch AVB+mics), profile-driven and ending with one big win, one negative result, and two recorded traps.
Profile first (callgrind, 12ch Q15 host)
Per-channel dot MACs ≈ 85% of the instruction stream; deinterleave/transport ~2% (hypothesis "scatter cost matters" falsified). The dots were the target.
The change
History stores frame-major (interleaved) instead of planar when
channels ≥ 4(SRT_CP_MIN_CHANNELS), floating-point samples, host targets — then each output frame computes all channels' dots in register-blocked tiles of 8/4/2/1 channels: one accumulator lane per channel, coefficient broadcast, contiguous loads. Bit-exact by construction: each channel's accumulation order over taps is unchanged (lanes are channels, not taps) — hash-verified identical to planar over 30k blocks × {float, Q15} × {12, 16}ch.This is the first optimization that lets the float path vectorize at all: tap-axis SIMD is forbidden by the strict double-accumulation contract (deferred hypothesis 5), but the channel axis was always free.
Measured (same-minute A/B)
-march=native(AVX2+FMA)-O2(SSE2)Gains scale with SIMD width — header-only consumers building with AVX2 get the 1.6–1.7×.
Negative result: fixed-point keeps planar
Channel-parallel Q15 measured ~1.5× slower than planar on hosts: planar Q15 already auto-vectorizes over the tap axis (integer reduction is exactly reassociable — C2's finding), and that axis beats the channel axis for int64-exact accumulation on x86. Recorded in PERFORMANCE.md; embedded channel-parallel (HVX 16×int64-lane, Helium) stays a follow-up candidate only if DSP budgets demand it.
Two traps, recorded for the next person
Campaign status: C1–C6 complete. README PERF table regenerated (same-session run).
https://claude.ai/code/session_01HuAFfoeD5a5Xe5aGNA16M9
Generated by Claude Code