Skip to content

Channel-parallel float dot for high channel counts (perf C6)#19

Merged
tap merged 1 commit into
mainfrom
claude/c6-channel-parallel
Jun 12, 2026
Merged

Channel-parallel float dot for high channel counts (perf C6)#19
tap merged 1 commit into
mainfrom
claude/c6-channel-parallel

Conversation

@tap

@tap tap commented Jun 12, 2026

Copy link
Copy Markdown
Owner

Roadmap item C6 — channel-parallel dot products for the high-channel-count deployments (12ch 7.1.4, 16ch AVB+mics), profile-driven and ending with one big win, one negative result, and two recorded traps.

Profile first (callgrind, 12ch Q15 host)

Per-channel dot MACs ≈ 85% of the instruction stream; deinterleave/transport ~2% (hypothesis "scatter cost matters" falsified). The dots were the target.

The change

History stores frame-major (interleaved) instead of planar when channels ≥ 4 (SRT_CP_MIN_CHANNELS), floating-point samples, host targets — then each output frame computes all channels' dots in register-blocked tiles of 8/4/2/1 channels: one accumulator lane per channel, coefficient broadcast, contiguous loads. Bit-exact by construction: each channel's accumulation order over taps is unchanged (lanes are channels, not taps) — hash-verified identical to planar over 30k blocks × {float, Q15} × {12, 16}ch.

This is the first optimization that lets the float path vectorize at all: tap-axis SIMD is forbidden by the strict double-accumulation contract (deferred hypothesis 5), but the channel axis was always free.

Measured (same-minute A/B)

Config Δ wall-clock
float 8ch, -march=native (AVX2+FMA) −41%
float 12ch, native −38%
float 16ch, native −42%
float 8/12/16ch, baseline -O2 (SSE2) −4–5%
float/Q15 2ch (gate-off controls) unchanged
all 14 embedded ratchet scenarios (M33/M55 measured locally, Hexagon by this PR's CI) 0.00%

Gains scale with SIMD width — header-only consumers building with AVX2 get the 1.6–1.7×.

Negative result: fixed-point keeps planar

Channel-parallel Q15 measured ~1.5× slower than planar on hosts: planar Q15 already auto-vectorizes over the tap axis (integer reduction is exactly reassociable — C2's finding), and that axis beats the channel axis for int64-exact accumulation on x86. Recorded in PERFORMANCE.md; embedded channel-parallel (HVX 16×int64-lane, Helium) stays a follow-up candidate only if DSP budgets demand it.

Two traps, recorded for the next person

  1. A naive channels-inner loop with accumulators in memory is 2.8× slower than planar — register-block (constexpr-size tile arrays) or don't bother.
  2. The mode gate must be compile-time: a runtime bool in the hot loops cost +6–8% on the M55 ratchet (caught by the embedded controls before push; the constexpr gate restored byte-identical codegen).

Campaign status: C1–C6 complete. README PERF table regenerated (same-session run).

https://claude.ai/code/session_01HuAFfoeD5a5Xe5aGNA16M9


Generated by Claude Code

Frame-major history + register-blocked 8/4/2/1 channel tiles when
channels >= 4, floating-point samples, host targets only (compile-time
gate; every embedded ratchet scenario verified 0.00%). Bit-exact vs
planar (per-channel tap order unchanged; hash-verified over 30k blocks
x 4 configs). Same-minute A/B: float 8/12/16ch -38/-38/-42% wall-clock
with AVX2+FMA, -4-5% on baseline SSE2. Fixed-point measured ~1.5x
slower channel-parallel and keeps planar (taps-axis auto-vectorization
already optimal) - negative result recorded in PERFORMANCE.md along
with the two implementation traps (memory accumulators; runtime gate
in hot loops).

https://claude.ai/code/session_01HuAFfoeD5a5Xe5aGNA16M9
@tap tap merged commit 5485078 into main Jun 12, 2026
24 checks passed
@tap tap deleted the claude/c6-channel-parallel branch June 27, 2026 19:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants