Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 6 additions & 1 deletion .github/workflows/ci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -168,7 +168,12 @@ jobs:
- name: Test under emulation
run: >
ctest --test-dir build --output-on-failure
-E 'AsrcQuality|AsrcLock|TwoThreadStress|TransparentPrototypeMeetsSpec|MultiChannel\.|Feasibility|Reset\.'
-E 'AsrcQuality|AsrcLock|TwoThreadStress|TransparentPrototypeMeetsSpec|MultiChannel\.|Feasibility|Reset\.|ConfigValidation'
# ConfigValidation: this static-musl toolchain cannot unwind across
# frames — the constructor throws correctly but EXPECT_THROW never
# catches and libc++abi terminates. Validation is target-independent
# and covered on every other leg; limitation tracked in
# docs/PERFORMANCE.md "Known debt".

# Cross-compile for Arm Cortex-M55 (bare metal, newlib + semihosting) and
# run the emulation-sized test subset on QEMU's MPS3 AN547 board model.
Expand Down
35 changes: 27 additions & 8 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@ slips that occur roughly once every `1/ppm` samples.
- Real-time safe audio path: `push()`/`pull()` are `noexcept`, lock-free and
allocation-free; all allocation and filter design happen in the constructor
- Measured quality (default *balanced* preset, +200 ppm offset, THD+N-style
residual): **133 dB** SNR at 997 Hz, **111 dB** at 12 kHz, **105 dB** at
residual): **135 dB** SNR at 997 Hz, **112 dB** at 12 kHz, **105 dB** at
19.5 kHz
- ~**1.5 ms** designed latency with the default configuration at 48 kHz
(24-frame filter group delay + 48-frame FIFO setpoint)
Expand Down Expand Up @@ -53,13 +53,27 @@ transparency vs. a naive FIFO, spectrograms, latency, drift tracking,
dropout recovery — see
[notebooks/asrc_demo.ipynb](notebooks/asrc_demo.ipynb), which drives the
library through its C ABI (`-DSRT_BUILD_CAPI=ON`, `tools/capi/`) via ctypes
(Python needs `numpy` and `matplotlib`; the first cell builds the shared
library if missing). A second notebook,
(Python needs `numpy` and `matplotlib`; the comparison notebook below
additionally needs the `samplerate` and `soxr` packages; the first cell
builds the shared library if missing). A second notebook,
[notebooks/asrc_block_size_study.ipynb](notebooks/asrc_block_size_study.ipynb),
measures how processing block size (32 / 64 / 240 frames) trades latency
against servo observability — including per-impulse latency-breathing
measurements and a calibrated FM/wideband quality decomposition.

For real hardware there are three more entry points:
`examples/alsa_bridge.cpp` (two ALSA devices on their real crystals — the
[hardware testing](docs/HARDWARE_TESTING.md) Setup 1 harness, with CSV
telemetry and post-ASRC capture), `examples/pico2_cyccnt/` (flashable
RP2350 firmware measuring real cycles per block against the QEMU
instruction baselines), and `examples/pico2_dualcore/` (the
one-clock-domain-per-core RP2350 deployment, self-validating).

**Consuming the library**: `add_subdirectory` or `FetchContent` only —
there are no install/package rules yet. Version 0.1.0 (`SRT_VERSION_*` in
`srt/srt.hpp`, `srt_version()` over the C ABI); pre-1.0, the API may
still change between versions.

## How it works

The design follows the classic commercial-ASRC architecture (AD1896-style
Expand Down Expand Up @@ -163,7 +177,7 @@ sample-granular transfer, 0.5 FS sine, 1 s analysis window after settling):

| Preset | 997 Hz | 6 kHz | 12 kHz | 19.5 kHz | group delay |
|---|---|---|---|---|---|
| `balanced()` (L=256, T=48) | 133 dB | 118 dB | 111 dB | 105 dB | 0.50 ms |
| `balanced()` (L=256, T=48) | 135 dB | 120 dB | 112 dB | 105 dB | 0.50 ms |
| `transparent()` (L=512, T=80) | 133 dB | — | — | 108 dB | 0.83 ms |

AES17-style THD+N measured under identical conditions against
Expand Down Expand Up @@ -210,15 +224,20 @@ CI builds and tests every push on:
- **Performance gating on both DSP targets**: fixed workloads run under
QEMU with an instruction-counting plugin and are compared against
committed baselines (`bench/baselines.json`) at ±3% — a hot-path
regression on Hexagon or Cortex-M55 fails CI. See
regression on Hexagon, Cortex-M55 or Cortex-M33 fails CI. See
[docs/PERFORMANCE.md](docs/PERFORMANCE.md).
- **Arm Cortex-M33** (Raspberry Pi Pico 2 / RP2350 class), bare metal on
QEMU's MPS2+ AN505 model, sharing the Armv8-M platform layer below. The
M33 has no FP64 and no Helium, and the instruction baselines make the
consequences concrete: the float datapath costs ~19× the M55's
instructions (soft-double accumulation) — on Pico-class parts use
Q15/Q31, where 48 kHz mono fits a 150 MHz core with room to spare and
stereo wants the `fast()` preset or the RP2350's second core.
Q15/Q31. The instruction baselines suggest 48 kHz Q15 mono fits a
150 MHz core and stereo wants the `fast()` preset or the RP2350's
second core — instruction counts are not cycle counts, so treat these
as budgets pending real-silicon validation: `examples/pico2_cyccnt/`
is a flashable DWT.CYCCNT harness built to measure exactly this, and
`examples/pico2_dualcore/` validates the one-clock-domain-per-core
deployment shape.
- **Arm Cortex-M55**, bare metal (newlib + semihosting, no OS/threads),
executed on QEMU's MPS3 AN547 board model via `qemu-system-arm`. The
platform layer lives in `platform/mps3_an547/` (linker script + minimal
Expand Down Expand Up @@ -302,7 +321,7 @@ The datapath is templated on the sample type via `srt::SampleTraits`

| Type | Alias | Format | Measured SNR (997 Hz / 19.5 kHz, half scale, +200 ppm) |
|---|---|---|---|
| `float` | `AsyncSampleRateConverter` | float I/O, double accumulation | 133 dB / 105 dB |
| `float` | `AsyncSampleRateConverter` | float I/O, double accumulation | 135 dB / 105 dB |
| `std::int32_t` | `AsyncSampleRateConverterQ31` | Q31 I/O, Q1.30 coeffs, int64 accumulation, saturating | 133 dB / 105 dB |
| `std::int16_t` | `AsyncSampleRateConverterQ15` | Q15 I/O, Q1.14 coeffs, int64 accumulation, saturating | 77 dB (format-limited) |

Expand Down
5 changes: 4 additions & 1 deletion cmake/hexagon-linux-musl.cmake
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,10 @@
# with hexagon-unknown-linux-musl-clang++ and qemu-hexagon on PATH.
#
# Note: emulation validates ISA-level *correctness* (32-bit size_t, atomics
# lowering, musl libc), not performance — Hexagon has no double-precision
# lowering, musl libc), not performance. Caveat: under this static-musl
# configuration C++ exceptions terminate (libc++abi) instead of
# propagating — constructor validation errors are fatal here; see the
# Known-debt entry in docs/PERFORMANCE.md — Hexagon has no double-precision
# FPU, so the double-heavy paths run soft-float. Cycle counts need the
# Hexagon SDK simulator.
set(CMAKE_SYSTEM_NAME Linux)
Expand Down
11 changes: 6 additions & 5 deletions docs/COMPARISON.md
Original file line number Diff line number Diff line change
Expand Up @@ -58,7 +58,7 @@ shared machine; all subjects ran in the same session.

| Engine (~120 dB tier) | mono | stereo | 8-ch | algorithmic latency |
|---|---:|---:|---:|---:|
| **SampleRateTap** balanced | 15.6 | 10.5 | 3.0 | **23.5 frames (0.49 ms)** |
| **SampleRateTap** balanced | 15.6 | 10.5 | 3.0 | **24 frames (0.50 ms)** |
| libsamplerate `MEDIUM` (0.2.2) | 4.4 | 3.7 | 1.4 | 46 frames (0.96 ms) |
| soxr `HQ` (0.1.3) | 72.9 | 32.4 | 8.4 | 556–607 frames (11.6–12.6 ms) |

Expand All @@ -82,6 +82,7 @@ Reading guide:
throughput at SampleRateTap's latency.
- **libsamplerate is the closest architectural analog** (streaming
time-domain polyphase, block-by-block) and SampleRateTap is 2.9–3.6×
(mono/stereo; 2.1× at 8 channels, where both engines amortize)
faster at the matched ~120 dB tier, 6.2× at ~140 dB, while also carrying
~2–3.6× less latency. That is the near-unity specialization dividend:
a 48-tap window with a creeping phase instead of general-ratio
Expand All @@ -104,15 +105,15 @@ libsamplerate 0.2.2; arm-none-eabi-gcc 13.2.1, hexagon-clang 19.1.5, -O2.

¹ The float datapath is soft-double-bound on the FP64-less M33 — the
README directs Pico-class parts to Q15, where the **full converter**
(servo and FIFO included) costs ~5,206 instructions/frame: libsamplerate
has no fixed-point path, so its cheapest option on such parts costs
**~9.5×** what SampleRateTap's intended configuration does.
(servo and FIFO included) costs ~5,043 instructions/frame (post-C4):
libsamplerate has no fixed-point path, so its cheapest option on such parts costs
**~9.8×** what SampleRateTap's intended configuration does.

## The landscape

| | Type | Clock recovery | Ratio range | Quality | Latency | Footprint / targets | License & form |
|---|---|---|---|---|---|---|---|
| **SampleRateTap** | software ASRC | built-in (PI servo on FIFO occupancy) | near-unity (±~1000 ppm) | −132 dB THD+N / 149 dB DR measured above; Q15/Q31 paths for FPU-less DSPs | **1.5 ms default** (0.5 ms filter); sub-ms with `fast()` | 308× RT/core x86; ~515 insn/sample Q15 on Hexagon, CI-gated | MIT, header-only C++20 |
| **SampleRateTap** | software ASRC | built-in (PI servo on FIFO occupancy) | near-unity (±~1000 ppm) | −132 dB THD+N / 149 dB DR measured above; Q15/Q31 paths for FPU-less DSPs | **1.5 ms default** (0.5 ms filter); sub-ms with `fast()` | 308× RT/core x86; ~515 insn/sample Q15 kernel-only on Hexagon (full converter ~1,245/frame stereo), CI-gated | MIT, header-only C++20 |
| [AD1896][ad1896] (ADI) | hardware ASRC | built-in | 1:8 up / 7.75:1 down | THD+N −117 dB min / −133 dB best; 142 dB DNR (datasheet) | sub-ms–ms, mode dependent | dedicated chip, one stereo pair | proprietary |
| [SRC4392][src4392] (TI) | hardware ASRC | built-in (automatic) | 1:16–16:1 | THD+N −140 dB typ; 144 dB DR (datasheet) | selectable filter delay | dedicated chip + DIR/DIT | proprietary |
| [libsamplerate][lsr] | resampler library | **no** — caller supplies ratio | 1/256–256 | measured above (near-unity); 97 dB worst-case across ratios (own docs) | filter-dependent, offline-friendly | portable C, float | BSD-2 |
Expand Down
33 changes: 17 additions & 16 deletions docs/HARDWARE_TESTING.md
Original file line number Diff line number Diff line change
Expand Up @@ -63,17 +63,20 @@ clock when the analog path is not trusted.

Two things this proves that emulation cannot:

- **The cycle budget.** [PERFORMANCE.md](PERFORMANCE.md) notes that QEMU
- **The cycle budget** (harness shipped: [`examples/pico2_cyccnt/`](../examples/pico2_cyccnt/)
builds a flashable UF2 for this measurement).
[PERFORMANCE.md](PERFORMANCE.md) notes that QEMU
gives deterministic *instruction* counts, not cycles, and real cycles
need hardware counters. The RP2350 has DWT.CYCCNT: wrapping `pull()`
in CYCCNT reads gives real cycles-per-block at 150 MHz — directly
testing the README's claim that Q15 mono fits comfortably and stereo
is tight on one core. Correlating CYCCNT against the QEMU instruction
baselines also calibrates the ratchet ("1 QEMU instruction ≈ N RP2350
cycles") for all future M33 numbers.
- **Dual-core deployment.** The README suggests dedicating the RP2350's
second core to stereo; an actual core1-runs-ASRC build verifies that
guidance.
- **Dual-core deployment** (harness shipped:
[`examples/pico2_dualcore/`](../examples/pico2_dualcore/), self-validating
PASS/FAIL phases). The README suggests dedicating the RP2350's second
core to one clock domain; flashing the example verifies that guidance.

## Setup 3 — two Pis over Ethernet

Expand All @@ -95,15 +98,13 @@ X ppm, N hours, zero discontinuities"). Then Setup 2, because
real-silicon CYCCNT numbers close the loop on everything the M33
emulation work predicted.

The code each setup needs:

- **Setup 1**: an ALSA duplex bridge example (two threads around
`push()`/`pull()`, telemetry logging to CSV, optional post-ASRC capture
to disk) plus a script to plot the ppm trace and analyze the captured
stream.
- **Setup 2**: a small Pico SDK firmware project wrapping the header-only
library — the M33 toolchain support already proves the code compiles
for that core (`cmake/arm-cortex-m33-mps2.cmake` shows the required
flags: `-mcpu=cortex-m33 -mthumb -mfloat-abi=hard`).
- **Setup 3**: two small programs (UDP sender, receiver-with-ASRC) reusing
the Setup 1 bridge's output half.
What exists and what remains:

- **Setup 1**: shipped — `examples/alsa_bridge.cpp` (see above). Still
missing: a small script to plot the `--csv` ppm trace and run the
notebook analysis over a `--dump` capture.
- **Setup 2**: shipped — `examples/pico2_cyccnt/` (cycle measurement) and
`examples/pico2_dualcore/` (dual-core deployment), both building
flashable UF2s; the measured numbers await a physical Pico 2.
- **Setup 3**: not yet written — two small programs (UDP sender,
receiver-with-ASRC) reusing the Setup 1 bridge's output half.
28 changes: 19 additions & 9 deletions docs/PERFORMANCE.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,10 +11,11 @@ the hot path follow it.
| Throughput | ns per output frame, steady-state `pull()`+`push()`, reported as ×realtime at 48 kHz | host (Google Benchmark) |
| Tail latency | p99/max per-call time for `pull(128)` over long runs — the RT budget lives in the tail, not the mean | host |
| Kernel cost | `srt::interpolate()` in isolation (≈ all datapath cycles: taps × channels MACs) | host |
| Embedded cost | **executed instructions** per output frame via QEMU TCG plugins — deterministic to the instruction, noise-free, well-correlated with real cost for scalar code | Hexagon (qemu-user), Cortex-M55 (qemu-system) |
| Embedded cost | **executed instructions** per output frame via QEMU TCG plugins — deterministic to the instruction, noise-free, well-correlated with real cost for scalar code | Hexagon (qemu-user), Cortex-M55 and Cortex-M33 (qemu-system) |

Cycle-accurate embedded numbers require vendor simulators (Hexagon SDK
simulator, Cadence xt-run) or hardware counters (DWT.CYCCNT on M55 silicon);
simulator, Cadence xt-run) or hardware counters (DWT.CYCCNT on M-class silicon —
`examples/pico2_cyccnt/` is a flashable RP2350 harness for exactly that);
the instruction metric is what CI can gate deterministically.

The benchmark matrix: sample type (float / Q15 / Q31) × filter preset
Expand All @@ -36,23 +37,26 @@ to the combinations that change the answer.

### Known hypotheses, in expected ROI order

1. **Per-channel blend redundancy**: `interpolate()` runs per channel with
1. **Per-channel blend redundancy** (done as C1; see status below):
`interpolate()` runs per channel with
the same μ, so the coefficient blend is recomputed per channel.
Precompute the blended row once per output frame (≤ 80 entries of
scratch), dot-product per channel. Roughly halves inner-loop work for
stereo; scales with channel count; makes the loop SIMD-friendlier.
2. **Auto-vectorization quality**: contiguity, aliasing, alignment of the
history window and coefficient rows. Verify, don't assume.
2. **Auto-vectorization quality** (done as C2; see status below):
contiguity, aliasing, alignment of the history window and coefficient
rows. Verify, don't assume.
3. **Fixed-point phase accumulator** (done as Q0.64; see status below).
Correction discovered while measuring: Cortex-M55's *scalar* FPU does
support FP64 (only MVE is fp16/fp32), so the M55 float path was never
soft-double-bound — Hexagon is the genuinely double-less target.
4. **Explicit SIMD kernels** — partially moot for M55: objdump confirms
GCC already auto-vectorizes the Q15/Q31 kernels with Helium at -O2
(the M55's ~4× Q15 advantage over the scalar M33 in the baselines is
MVE at work). Remaining candidates: packed SMLAD Q15 kernel for
M33/Pico-class parts (their binaries are nearly DSP-extension-free
today), NEON/AVX2 for hosts — only if budgets demand.
MVE at work). The packed dual-MAC Q15 kernel for M33/Pico-class parts
shipped as C4 (SMLALD; those binaries now carry it); the host float
channel axis shipped as C6. Remaining: NEON/AVX2 tap-axis work and
embedded channel-parallel (HVX/Helium) — only if budgets demand.

## "Done" criteria

Expand All @@ -75,7 +79,7 @@ baseline lands and revised deliberately. Stop when any of:

Mechanics: `bench/icount/` builds one fixed-workload binary per scenario
(no argv on bare metal); `tools/qemu_insn_plugin/` is the counting
plugin; `scripts/icount.py --target {m55,hexagon} --build-dir D --plugin
plugin; `scripts/icount.py --target {m55,m33,hexagon} --build-dir D --plugin
P [--update]` runs and compares; targets are m55, m33 (mps2-an505) and
hexagon. Counts are exact across runs (verified),
but they are a function of the **compiler version**: when the CI
Expand Down Expand Up @@ -105,6 +109,12 @@ table is already enforced by test thresholds.
the matching comment).
- **Tail-latency benchmark not implemented**: the Metrics table promises
p99/max per-call `pull(128)` timing; no benchmark measures it yet.
- **Hexagon static-musl cannot catch exceptions**: a constructor throw
terminates via libc++abi instead of propagating (discovered when the
first EXPECT_THROW test reached that leg; ConfigValidation is excluded
there). Deployment note: on this toolchain configuration, treat invalid
Config as fatal — validate inputs before constructing. Candidate fix:
link an unwinder (-unwindlib=libunwind) in cmake/hexagon-linux-musl.cmake.

## Sequencing & status

Expand Down
5 changes: 5 additions & 0 deletions examples/alsa_bridge.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -229,6 +229,11 @@ int main(int argc, char** argv) {
cfg.sampleRateHz = static_cast<double>(args.rate);
cfg.channels = args.channels;
cfg.targetLatencyFrames = args.latency;
// Per the ServoConfig guidance: the unlock threshold must sit
// comfortably above half the transfer block, or block-quantized
// occupancy excursions can demote the servo stage spuriously.
cfg.servo.unlockThresholdFrames =
std::max(cfg.servo.unlockThresholdFrames, 1.5 * static_cast<double>(args.period));
srt::AsyncSampleRateConverter asrc(cfg);
std::printf("designed latency: %.2f ms%s\n", asrc.designedLatencySeconds() * 1e3,
args.toneHz > 0.0 ? " (tone mode: captured samples discarded)" : "");
Expand Down
3 changes: 2 additions & 1 deletion examples/pico2_cyccnt/CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -42,7 +42,8 @@ set(CMAKE_CXX_STANDARD_REQUIRED ON)
pico_sdk_init()

# Float datapath measurement: soft-double accumulation on the M33, expected
# ~19x the Q15 instruction count — slow but a real number is still valuable.
# ~3.8x the Q15 instruction count (1,856.7M vs 484.1M baselines) — slow
# but a real number is still valuable.
option(PICO2_MEASURE_FLOAT "Measure the float (soft FP64) datapath too" ON)

add_executable(pico2_cyccnt main.cpp)
Expand Down
4 changes: 2 additions & 2 deletions notebooks/asrc_demo.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -307,7 +307,7 @@
"id": "222bb3d0",
"metadata": {},
"source": [
"Roughly ten audible clicks per second, and an SNR in the 30s. This is\n",
"Roughly ten audible clicks per second, and an SNR around 29 dB. This is\n",
"what every \"just use a ring buffer\" design does at some rate, whether its\n",
"author knows it or not.\n",
"\n",
Expand Down Expand Up @@ -727,7 +727,7 @@
"\n",
"| What | Measured here |\n",
"|---|---|\n",
"| Naive FIFO at +200 ppm | clicks ~10×/s, SNR in the 30s dB |\n",
"| Naive FIFO at +200 ppm | clicks ~10×/s, SNR around 29 dB dB |\n",
"| SampleRateTap, same conditions | **SNR > 130 dB** — at the 24-bit noise floor |\n",
"| Lock from cold start | ~1 s |\n",
"| Latency | ≈ designed 1.5 ms, linear phase |\n",
Expand Down
2 changes: 1 addition & 1 deletion tests/test_asrc_quality.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -56,7 +56,7 @@ double measureSnrDb(const srt::FilterSpec& spec, double freqHz) {
return snr;
}

// Thresholds sit ~4 dB under measured performance (133/118/111/105 dB for
// Thresholds sit 4-7 dB under measured performance (135/120/113/106 dB for
// balanced at 997/6k/12k/19.5k; 133/108 dB for transparent). The residual at
// high frequencies is dominated by the linear interpolation between adjacent
// phase-table rows, which falls ~12 dB per doubling of numPhases and rises
Expand Down
4 changes: 2 additions & 2 deletions tests/test_fixed_point.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -100,10 +100,10 @@ TEST(FixedPoint, AsrcQualityQ15_997Hz) {
EXPECT_GT(measureSnrDb<std::int16_t>(997.0, 0.5), 73.0);
}
TEST(FixedPoint, AsrcQualityQ31_997Hz) {
EXPECT_GT(measureSnrDb<std::int32_t>(997.0, 0.5), 124.0);
EXPECT_GT(measureSnrDb<std::int32_t>(997.0, 0.5), 124.0); // measured ~133 dB
}
TEST(FixedPoint, AsrcQualityQ31_19_5kHz) {
EXPECT_GT(measureSnrDb<std::int32_t>(19500.0, 0.5), 96.0);
EXPECT_GT(measureSnrDb<std::int32_t>(19500.0, 0.5), 96.0); // measured ~105 dB
}

TEST(FixedPoint, FullScaleSineDoesNotWrapQ15) {
Expand Down
Loading