tap · tap · Jun 13, 2026 · Jun 12, 2026 · Jun 12, 2026
diff --git a/.github/workflows/ci.yml b/.github/workflows/ci.yml
@@ -168,7 +168,12 @@ jobs:
       - name: Test under emulation
         run: >
           ctest --test-dir build --output-on-failure
-          -E 'AsrcQuality|AsrcLock|TwoThreadStress|TransparentPrototypeMeetsSpec|MultiChannel\.|Feasibility|Reset\.'
+          -E 'AsrcQuality|AsrcLock|TwoThreadStress|TransparentPrototypeMeetsSpec|MultiChannel\.|Feasibility|Reset\.|ConfigValidation'
+        # ConfigValidation: this static-musl toolchain cannot unwind across
+        # frames — the constructor throws correctly but EXPECT_THROW never
+        # catches and libc++abi terminates. Validation is target-independent
+        # and covered on every other leg; limitation tracked in
+        # docs/PERFORMANCE.md "Known debt".
 
   # Cross-compile for Arm Cortex-M55 (bare metal, newlib + semihosting) and
   # run the emulation-sized test subset on QEMU's MPS3 AN547 board model.

diff --git a/README.md b/README.md
@@ -16,7 +16,7 @@ slips that occur roughly once every `1/ppm` samples.
 - Real-time safe audio path: `push()`/`pull()` are `noexcept`, lock-free and
   allocation-free; all allocation and filter design happen in the constructor
 - Measured quality (default *balanced* preset, +200 ppm offset, THD+N-style
-  residual): **133 dB** SNR at 997 Hz, **111 dB** at 12 kHz, **105 dB** at
+  residual): **135 dB** SNR at 997 Hz, **112 dB** at 12 kHz, **105 dB** at
   19.5 kHz
 - ~**1.5 ms** designed latency with the default configuration at 48 kHz
   (24-frame filter group delay + 48-frame FIFO setpoint)
@@ -53,13 +53,27 @@ transparency vs. a naive FIFO, spectrograms, latency, drift tracking,
 dropout recovery — see
 [notebooks/asrc_demo.ipynb](notebooks/asrc_demo.ipynb), which drives the
 library through its C ABI (`-DSRT_BUILD_CAPI=ON`, `tools/capi/`) via ctypes
-(Python needs `numpy` and `matplotlib`; the first cell builds the shared
-library if missing). A second notebook,
+(Python needs `numpy` and `matplotlib`; the comparison notebook below
+additionally needs the `samplerate` and `soxr` packages; the first cell
+builds the shared library if missing). A second notebook,
 [notebooks/asrc_block_size_study.ipynb](notebooks/asrc_block_size_study.ipynb),
 measures how processing block size (32 / 64 / 240 frames) trades latency
 against servo observability — including per-impulse latency-breathing
 measurements and a calibrated FM/wideband quality decomposition.
 
+For real hardware there are three more entry points:
+`examples/alsa_bridge.cpp` (two ALSA devices on their real crystals — the
+[hardware testing](docs/HARDWARE_TESTING.md) Setup 1 harness, with CSV
+telemetry and post-ASRC capture), `examples/pico2_cyccnt/` (flashable
+RP2350 firmware measuring real cycles per block against the QEMU
+instruction baselines), and `examples/pico2_dualcore/` (the
+one-clock-domain-per-core RP2350 deployment, self-validating).
+
+**Consuming the library**: `add_subdirectory` or `FetchContent` only —
+there are no install/package rules yet. Version 0.1.0 (`SRT_VERSION_*` in
+`srt/srt.hpp`, `srt_version()` over the C ABI); pre-1.0, the API may
+still change between versions.
+
 ## How it works
 
 The design follows the classic commercial-ASRC architecture (AD1896-style
@@ -163,7 +177,7 @@ sample-granular transfer, 0.5 FS sine, 1 s analysis window after settling):
 
 | Preset | 997 Hz | 6 kHz | 12 kHz | 19.5 kHz | group delay |
 |---|---|---|---|---|---|
-| `balanced()` (L=256, T=48) | 133 dB | 118 dB | 111 dB | 105 dB | 0.50 ms |
+| `balanced()` (L=256, T=48) | 135 dB | 120 dB | 112 dB | 105 dB | 0.50 ms |
 | `transparent()` (L=512, T=80) | 133 dB | — | — | 108 dB | 0.83 ms |
 
 AES17-style THD+N measured under identical conditions against
@@ -210,15 +224,20 @@ CI builds and tests every push on:
 - **Performance gating on both DSP targets**: fixed workloads run under
   QEMU with an instruction-counting plugin and are compared against
   committed baselines (`bench/baselines.json`) at ±3% — a hot-path
-  regression on Hexagon or Cortex-M55 fails CI. See
+  regression on Hexagon, Cortex-M55 or Cortex-M33 fails CI. See
   [docs/PERFORMANCE.md](docs/PERFORMANCE.md).
 - **Arm Cortex-M33** (Raspberry Pi Pico 2 / RP2350 class), bare metal on
   QEMU's MPS2+ AN505 model, sharing the Armv8-M platform layer below. The
   M33 has no FP64 and no Helium, and the instruction baselines make the
   consequences concrete: the float datapath costs ~19× the M55's
   instructions (soft-double accumulation) — on Pico-class parts use
-  Q15/Q31, where 48 kHz mono fits a 150 MHz core with room to spare and
-  stereo wants the `fast()` preset or the RP2350's second core.
+  Q15/Q31. The instruction baselines suggest 48 kHz Q15 mono fits a
+  150 MHz core and stereo wants the `fast()` preset or the RP2350's
+  second core — instruction counts are not cycle counts, so treat these
+  as budgets pending real-silicon validation: `examples/pico2_cyccnt/`
+  is a flashable DWT.CYCCNT harness built to measure exactly this, and
+  `examples/pico2_dualcore/` validates the one-clock-domain-per-core
+  deployment shape.
 - **Arm Cortex-M55**, bare metal (newlib + semihosting, no OS/threads),
   executed on QEMU's MPS3 AN547 board model via `qemu-system-arm`. The
   platform layer lives in `platform/mps3_an547/` (linker script + minimal
@@ -302,7 +321,7 @@ The datapath is templated on the sample type via `srt::SampleTraits`
 
 | Type | Alias | Format | Measured SNR (997 Hz / 19.5 kHz, half scale, +200 ppm) |
 |---|---|---|---|
-| `float` | `AsyncSampleRateConverter` | float I/O, double accumulation | 133 dB / 105 dB |
+| `float` | `AsyncSampleRateConverter` | float I/O, double accumulation | 135 dB / 105 dB |
 | `std::int32_t` | `AsyncSampleRateConverterQ31` | Q31 I/O, Q1.30 coeffs, int64 accumulation, saturating | 133 dB / 105 dB |
 | `std::int16_t` | `AsyncSampleRateConverterQ15` | Q15 I/O, Q1.14 coeffs, int64 accumulation, saturating | 77 dB (format-limited) |
 

diff --git a/cmake/hexagon-linux-musl.cmake b/cmake/hexagon-linux-musl.cmake
@@ -7,7 +7,10 @@
 # with hexagon-unknown-linux-musl-clang++ and qemu-hexagon on PATH.
 #
 # Note: emulation validates ISA-level *correctness* (32-bit size_t, atomics
-# lowering, musl libc), not performance — Hexagon has no double-precision
+# lowering, musl libc), not performance. Caveat: under this static-musl
+# configuration C++ exceptions terminate (libc++abi) instead of
+# propagating — constructor validation errors are fatal here; see the
+# Known-debt entry in docs/PERFORMANCE.md — Hexagon has no double-precision
 # FPU, so the double-heavy paths run soft-float. Cycle counts need the
 # Hexagon SDK simulator.
 set(CMAKE_SYSTEM_NAME Linux)

diff --git a/docs/COMPARISON.md b/docs/COMPARISON.md
@@ -58,7 +58,7 @@ shared machine; all subjects ran in the same session.
 
 | Engine (~120 dB tier) | mono | stereo | 8-ch | algorithmic latency |
 |---|---:|---:|---:|---:|
-| **SampleRateTap** balanced | 15.6 | 10.5 | 3.0 | **23.5 frames (0.49 ms)** |
+| **SampleRateTap** balanced | 15.6 | 10.5 | 3.0 | **24 frames (0.50 ms)** |
 | libsamplerate `MEDIUM` (0.2.2) | 4.4 | 3.7 | 1.4 | 46 frames (0.96 ms) |
 | soxr `HQ` (0.1.3) | 72.9 | 32.4 | 8.4 | 556–607 frames (11.6–12.6 ms) |
 
@@ -82,6 +82,7 @@ Reading guide:
   throughput at SampleRateTap's latency.
 - **libsamplerate is the closest architectural analog** (streaming
   time-domain polyphase, block-by-block) and SampleRateTap is 2.9–3.6×
+  (mono/stereo; 2.1× at 8 channels, where both engines amortize)
   faster at the matched ~120 dB tier, 6.2× at ~140 dB, while also carrying
   ~2–3.6× less latency. That is the near-unity specialization dividend:
   a 48-tap window with a creeping phase instead of general-ratio
@@ -104,15 +105,15 @@ libsamplerate 0.2.2; arm-none-eabi-gcc 13.2.1, hexagon-clang 19.1.5, -O2.
 
 ¹ The float datapath is soft-double-bound on the FP64-less M33 — the
 README directs Pico-class parts to Q15, where the **full converter**
-(servo and FIFO included) costs ~5,206 instructions/frame: libsamplerate
-has no fixed-point path, so its cheapest option on such parts costs
-**~9.5×** what SampleRateTap's intended configuration does.
+(servo and FIFO included) costs ~5,043 instructions/frame (post-C4):
+libsamplerate has no fixed-point path, so its cheapest option on such parts costs
+**~9.8×** what SampleRateTap's intended configuration does.
 
 ## The landscape
 
 | | Type | Clock recovery | Ratio range | Quality | Latency | Footprint / targets | License & form |
 |---|---|---|---|---|---|---|---|
-| **SampleRateTap** | software ASRC | built-in (PI servo on FIFO occupancy) | near-unity (±~1000 ppm) | −132 dB THD+N / 149 dB DR measured above; Q15/Q31 paths for FPU-less DSPs | **1.5 ms default** (0.5 ms filter); sub-ms with `fast()` | 308× RT/core x86; ~515 insn/sample Q15 on Hexagon, CI-gated | MIT, header-only C++20 |
+| **SampleRateTap** | software ASRC | built-in (PI servo on FIFO occupancy) | near-unity (±~1000 ppm) | −132 dB THD+N / 149 dB DR measured above; Q15/Q31 paths for FPU-less DSPs | **1.5 ms default** (0.5 ms filter); sub-ms with `fast()` | 308× RT/core x86; ~515 insn/sample Q15 kernel-only on Hexagon (full converter ~1,245/frame stereo), CI-gated | MIT, header-only C++20 |
 | [AD1896][ad1896] (ADI) | hardware ASRC | built-in | 1:8 up / 7.75:1 down | THD+N −117 dB min / −133 dB best; 142 dB DNR (datasheet) | sub-ms–ms, mode dependent | dedicated chip, one stereo pair | proprietary |
 | [SRC4392][src4392] (TI) | hardware ASRC | built-in (automatic) | 1:16–16:1 | THD+N −140 dB typ; 144 dB DR (datasheet) | selectable filter delay | dedicated chip + DIR/DIT | proprietary |
 | [libsamplerate][lsr] | resampler library | **no** — caller supplies ratio | 1/256–256 | measured above (near-unity); 97 dB worst-case across ratios (own docs) | filter-dependent, offline-friendly | portable C, float | BSD-2 |

diff --git a/docs/HARDWARE_TESTING.md b/docs/HARDWARE_TESTING.md
@@ -63,17 +63,20 @@ clock when the analog path is not trusted.
 
 Two things this proves that emulation cannot:
 
-- **The cycle budget.** [PERFORMANCE.md](PERFORMANCE.md) notes that QEMU
+- **The cycle budget** (harness shipped: [`examples/pico2_cyccnt/`](../examples/pico2_cyccnt/)
+  builds a flashable UF2 for this measurement).
+  [PERFORMANCE.md](PERFORMANCE.md) notes that QEMU
   gives deterministic *instruction* counts, not cycles, and real cycles
   need hardware counters. The RP2350 has DWT.CYCCNT: wrapping `pull()`
   in CYCCNT reads gives real cycles-per-block at 150 MHz — directly
   testing the README's claim that Q15 mono fits comfortably and stereo
   is tight on one core. Correlating CYCCNT against the QEMU instruction
   baselines also calibrates the ratchet ("1 QEMU instruction ≈ N RP2350
   cycles") for all future M33 numbers.
-- **Dual-core deployment.** The README suggests dedicating the RP2350's
-  second core to stereo; an actual core1-runs-ASRC build verifies that
-  guidance.
+- **Dual-core deployment** (harness shipped:
+  [`examples/pico2_dualcore/`](../examples/pico2_dualcore/), self-validating
+  PASS/FAIL phases). The README suggests dedicating the RP2350's second
+  core to one clock domain; flashing the example verifies that guidance.
 
 ## Setup 3 — two Pis over Ethernet
 
@@ -95,15 +98,13 @@ X ppm, N hours, zero discontinuities"). Then Setup 2, because
 real-silicon CYCCNT numbers close the loop on everything the M33
 emulation work predicted.
 
-The code each setup needs:
-
-- **Setup 1**: an ALSA duplex bridge example (two threads around
-  `push()`/`pull()`, telemetry logging to CSV, optional post-ASRC capture
-  to disk) plus a script to plot the ppm trace and analyze the captured
-  stream.
-- **Setup 2**: a small Pico SDK firmware project wrapping the header-only
-  library — the M33 toolchain support already proves the code compiles
-  for that core (`cmake/arm-cortex-m33-mps2.cmake` shows the required
-  flags: `-mcpu=cortex-m33 -mthumb -mfloat-abi=hard`).
-- **Setup 3**: two small programs (UDP sender, receiver-with-ASRC) reusing
-  the Setup 1 bridge's output half.
+What exists and what remains:
+
+- **Setup 1**: shipped — `examples/alsa_bridge.cpp` (see above). Still
+  missing: a small script to plot the `--csv` ppm trace and run the
+  notebook analysis over a `--dump` capture.
+- **Setup 2**: shipped — `examples/pico2_cyccnt/` (cycle measurement) and
+  `examples/pico2_dualcore/` (dual-core deployment), both building
+  flashable UF2s; the measured numbers await a physical Pico 2.
+- **Setup 3**: not yet written — two small programs (UDP sender,
+  receiver-with-ASRC) reusing the Setup 1 bridge's output half.
diff --git a/docs/PERFORMANCE.md b/docs/PERFORMANCE.md
@@ -11,10 +11,11 @@ the hot path follow it.
 | Throughput | ns per output frame, steady-state `pull()`+`push()`, reported as ×realtime at 48 kHz | host (Google Benchmark) |
 | Tail latency | p99/max per-call time for `pull(128)` over long runs — the RT budget lives in the tail, not the mean | host |
 | Kernel cost | `srt::interpolate()` in isolation (≈ all datapath cycles: taps × channels MACs) | host |
-| Embedded cost | **executed instructions** per output frame via QEMU TCG plugins — deterministic to the instruction, noise-free, well-correlated with real cost for scalar code | Hexagon (qemu-user), Cortex-M55 (qemu-system) |
+| Embedded cost | **executed instructions** per output frame via QEMU TCG plugins — deterministic to the instruction, noise-free, well-correlated with real cost for scalar code | Hexagon (qemu-user), Cortex-M55 and Cortex-M33 (qemu-system) |
 
 Cycle-accurate embedded numbers require vendor simulators (Hexagon SDK
-simulator, Cadence xt-run) or hardware counters (DWT.CYCCNT on M55 silicon);
+simulator, Cadence xt-run) or hardware counters (DWT.CYCCNT on M-class silicon —
+`examples/pico2_cyccnt/` is a flashable RP2350 harness for exactly that);
 the instruction metric is what CI can gate deterministically.
 
 The benchmark matrix: sample type (float / Q15 / Q31) × filter preset
@@ -36,23 +37,26 @@ to the combinations that change the answer.
 
 ### Known hypotheses, in expected ROI order
 
-1. **Per-channel blend redundancy**: `interpolate()` runs per channel with
+1. **Per-channel blend redundancy** (done as C1; see status below):
+   `interpolate()` runs per channel with
    the same μ, so the coefficient blend is recomputed per channel.
    Precompute the blended row once per output frame (≤ 80 entries of
    scratch), dot-product per channel. Roughly halves inner-loop work for
    stereo; scales with channel count; makes the loop SIMD-friendlier.
-2. **Auto-vectorization quality**: contiguity, aliasing, alignment of the
-   history window and coefficient rows. Verify, don't assume.
+2. **Auto-vectorization quality** (done as C2; see status below):
+   contiguity, aliasing, alignment of the history window and coefficient
+   rows. Verify, don't assume.
 3. **Fixed-point phase accumulator** (done as Q0.64; see status below).
    Correction discovered while measuring: Cortex-M55's *scalar* FPU does
    support FP64 (only MVE is fp16/fp32), so the M55 float path was never
    soft-double-bound — Hexagon is the genuinely double-less target.
 4. **Explicit SIMD kernels** — partially moot for M55: objdump confirms
    GCC already auto-vectorizes the Q15/Q31 kernels with Helium at -O2
    (the M55's ~4× Q15 advantage over the scalar M33 in the baselines is
-   MVE at work). Remaining candidates: packed SMLAD Q15 kernel for
-   M33/Pico-class parts (their binaries are nearly DSP-extension-free
-   today), NEON/AVX2 for hosts — only if budgets demand.
+   MVE at work). The packed dual-MAC Q15 kernel for M33/Pico-class parts
+   shipped as C4 (SMLALD; those binaries now carry it); the host float
+   channel axis shipped as C6. Remaining: NEON/AVX2 tap-axis work and
+   embedded channel-parallel (HVX/Helium) — only if budgets demand.
 
 ## "Done" criteria
 
@@ -75,7 +79,7 @@ baseline lands and revised deliberately. Stop when any of:
 
   Mechanics: `bench/icount/` builds one fixed-workload binary per scenario
   (no argv on bare metal); `tools/qemu_insn_plugin/` is the counting
-  plugin; `scripts/icount.py --target {m55,hexagon} --build-dir D --plugin
+  plugin; `scripts/icount.py --target {m55,m33,hexagon} --build-dir D --plugin
   P [--update]` runs and compares; targets are m55, m33 (mps2-an505) and
   hexagon. Counts are exact across runs (verified),
   but they are a function of the **compiler version**: when the CI
@@ -105,6 +109,12 @@ table is already enforced by test thresholds.
   the matching comment).
 - **Tail-latency benchmark not implemented**: the Metrics table promises
   p99/max per-call `pull(128)` timing; no benchmark measures it yet.
+- **Hexagon static-musl cannot catch exceptions**: a constructor throw
+  terminates via libc++abi instead of propagating (discovered when the
+  first EXPECT_THROW test reached that leg; ConfigValidation is excluded
+  there). Deployment note: on this toolchain configuration, treat invalid
+  Config as fatal — validate inputs before constructing. Candidate fix:
+  link an unwinder (-unwindlib=libunwind) in cmake/hexagon-linux-musl.cmake.
 
 ## Sequencing & status
 

diff --git a/examples/alsa_bridge.cpp b/examples/alsa_bridge.cpp
@@ -229,6 +229,11 @@ int main(int argc, char** argv) {
     cfg.sampleRateHz = static_cast<double>(args.rate);
     cfg.channels = args.channels;
     cfg.targetLatencyFrames = args.latency;
+    // Per the ServoConfig guidance: the unlock threshold must sit
+    // comfortably above half the transfer block, or block-quantized
+    // occupancy excursions can demote the servo stage spuriously.
+    cfg.servo.unlockThresholdFrames =
+        std::max(cfg.servo.unlockThresholdFrames, 1.5 * static_cast<double>(args.period));
     srt::AsyncSampleRateConverter asrc(cfg);
     std::printf("designed latency: %.2f ms%s\n", asrc.designedLatencySeconds() * 1e3,
                 args.toneHz > 0.0 ? "  (tone mode: captured samples discarded)" : "");

diff --git a/examples/pico2_cyccnt/CMakeLists.txt b/examples/pico2_cyccnt/CMakeLists.txt
@@ -42,7 +42,8 @@ set(CMAKE_CXX_STANDARD_REQUIRED ON)
 pico_sdk_init()
 
 # Float datapath measurement: soft-double accumulation on the M33, expected
-# ~19x the Q15 instruction count — slow but a real number is still valuable.
+# ~3.8x the Q15 instruction count (1,856.7M vs 484.1M baselines) — slow
+# but a real number is still valuable.
 option(PICO2_MEASURE_FLOAT "Measure the float (soft FP64) datapath too" ON)
 
 add_executable(pico2_cyccnt main.cpp)

diff --git a/notebooks/asrc_demo.ipynb b/notebooks/asrc_demo.ipynb
@@ -307,7 +307,7 @@
    "id": "222bb3d0",
    "metadata": {},
    "source": [
-    "Roughly ten audible clicks per second, and an SNR in the 30s. This is\n",
+    "Roughly ten audible clicks per second, and an SNR around 29 dB. This is\n",
     "what every \"just use a ring buffer\" design does at some rate, whether its\n",
     "author knows it or not.\n",
     "\n",
@@ -727,7 +727,7 @@
     "\n",
     "| What | Measured here |\n",
     "|---|---|\n",
-    "| Naive FIFO at +200 ppm | clicks ~10×/s, SNR in the 30s dB |\n",
+    "| Naive FIFO at +200 ppm | clicks ~10×/s, SNR around 29 dB dB |\n",
     "| SampleRateTap, same conditions | **SNR > 130 dB** — at the 24-bit noise floor |\n",
     "| Lock from cold start | ~1 s |\n",
     "| Latency | ≈ designed 1.5 ms, linear phase |\n",

diff --git a/tests/test_asrc_quality.cpp b/tests/test_asrc_quality.cpp
@@ -56,7 +56,7 @@ double measureSnrDb(const srt::FilterSpec& spec, double freqHz) {
     return snr;
 }
 
-// Thresholds sit ~4 dB under measured performance (133/118/111/105 dB for
+// Thresholds sit 4-7 dB under measured performance (135/120/113/106 dB for
 // balanced at 997/6k/12k/19.5k; 133/108 dB for transparent). The residual at
 // high frequencies is dominated by the linear interpolation between adjacent
 // phase-table rows, which falls ~12 dB per doubling of numPhases and rises

diff --git a/tests/test_fixed_point.cpp b/tests/test_fixed_point.cpp
@@ -100,10 +100,10 @@ TEST(FixedPoint, AsrcQualityQ15_997Hz) {
     EXPECT_GT(measureSnrDb<std::int16_t>(997.0, 0.5), 73.0);
 }
 TEST(FixedPoint, AsrcQualityQ31_997Hz) {
-    EXPECT_GT(measureSnrDb<std::int32_t>(997.0, 0.5), 124.0);
+    EXPECT_GT(measureSnrDb<std::int32_t>(997.0, 0.5), 124.0); // measured ~133 dB
 }
 TEST(FixedPoint, AsrcQualityQ31_19_5kHz) {
-    EXPECT_GT(measureSnrDb<std::int32_t>(19500.0, 0.5), 96.0);
+    EXPECT_GT(measureSnrDb<std::int32_t>(19500.0, 0.5), 96.0); // measured ~105 dB
 }
 
 TEST(FixedPoint, FullScaleSineDoesNotWrapQ15) {