Channel-parallel float dot for high channel counts (perf C6) by tap · Pull Request #19 · tap/SampleRateTap

tap · 2026-06-12T17:21:15Z

Roadmap item C6 — channel-parallel dot products for the high-channel-count deployments (12ch 7.1.4, 16ch AVB+mics), profile-driven and ending with one big win, one negative result, and two recorded traps.

Profile first (callgrind, 12ch Q15 host)

Per-channel dot MACs ≈ 85% of the instruction stream; deinterleave/transport ~2% (hypothesis "scatter cost matters" falsified). The dots were the target.

The change

History stores frame-major (interleaved) instead of planar when channels ≥ 4 (SRT_CP_MIN_CHANNELS), floating-point samples, host targets — then each output frame computes all channels' dots in register-blocked tiles of 8/4/2/1 channels: one accumulator lane per channel, coefficient broadcast, contiguous loads. Bit-exact by construction: each channel's accumulation order over taps is unchanged (lanes are channels, not taps) — hash-verified identical to planar over 30k blocks × {float, Q15} × {12, 16}ch.

This is the first optimization that lets the float path vectorize at all: tap-axis SIMD is forbidden by the strict double-accumulation contract (deferred hypothesis 5), but the channel axis was always free.

Measured (same-minute A/B)

Config	Δ wall-clock
float 8ch, `-march=native` (AVX2+FMA)	−41%
float 12ch, native	−38%
float 16ch, native	−42%
float 8/12/16ch, baseline `-O2` (SSE2)	−4–5%
float/Q15 2ch (gate-off controls)	unchanged
all 14 embedded ratchet scenarios (M33/M55 measured locally, Hexagon by this PR's CI)	0.00%

Gains scale with SIMD width — header-only consumers building with AVX2 get the 1.6–1.7×.

Negative result: fixed-point keeps planar

Channel-parallel Q15 measured ~1.5× slower than planar on hosts: planar Q15 already auto-vectorizes over the tap axis (integer reduction is exactly reassociable — C2's finding), and that axis beats the channel axis for int64-exact accumulation on x86. Recorded in PERFORMANCE.md; embedded channel-parallel (HVX 16×int64-lane, Helium) stays a follow-up candidate only if DSP budgets demand it.

Two traps, recorded for the next person

A naive channels-inner loop with accumulators in memory is 2.8× slower than planar — register-block (constexpr-size tile arrays) or don't bother.
The mode gate must be compile-time: a runtime bool in the hot loops cost +6–8% on the M55 ratchet (caught by the embedded controls before push; the constexpr gate restored byte-identical codegen).

Campaign status: C1–C6 complete. README PERF table regenerated (same-session run).

https://claude.ai/code/session_01HuAFfoeD5a5Xe5aGNA16M9

Generated by Claude Code

Frame-major history + register-blocked 8/4/2/1 channel tiles when channels >= 4, floating-point samples, host targets only (compile-time gate; every embedded ratchet scenario verified 0.00%). Bit-exact vs planar (per-channel tap order unchanged; hash-verified over 30k blocks x 4 configs). Same-minute A/B: float 8/12/16ch -38/-38/-42% wall-clock with AVX2+FMA, -4-5% on baseline SSE2. Fixed-point measured ~1.5x slower channel-parallel and keeps planar (taps-axis auto-vectorization already optimal) - negative result recorded in PERFORMANCE.md along with the two implementation traps (memory accumulators; runtime gate in hot loops). https://claude.ai/code/session_01HuAFfoeD5a5Xe5aGNA16M9

tap merged commit 5485078 into main Jun 12, 2026
24 checks passed

tap deleted the claude/c6-channel-parallel branch June 27, 2026 19:13

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Channel-parallel float dot for high channel counts (perf C6)#19

Channel-parallel float dot for high channel counts (perf C6)#19
tap merged 1 commit into
mainfrom
claude/c6-channel-parallel

tap commented Jun 12, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

tap commented Jun 12, 2026

Profile first (callgrind, 12ch Q15 host)

The change

Measured (same-minute A/B)

Negative result: fixed-point keeps planar

Two traps, recorded for the next person

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants