Skip to content

Latency Improvement Experimentation#238

Open
BuffMcBigHuge wants to merge 1 commit into
mainfrom
marco/feat/latency
Open

Latency Improvement Experimentation#238
BuffMcBigHuge wants to merge 1 commit into
mainfrom
marco/feat/latency

Conversation

@BuffMcBigHuge

Copy link
Copy Markdown
Collaborator

DEMON end-to-end latency: review, measurements, changes

This work was done in collaboration with Claude Fable Extra High

This document is the result of a full latency review of the DEMON
streaming stack — from a knob turn in the web demo (or any client) to
the audio that comes out — including the TRT inference layer. It
covers:

  1. The knob-to-ear latency path and
    every contributor we control.
  2. The e2e test strategy (tests/e2e_latency/,
    runnable outside the demo app).
  3. Measured results before/after on this
    build's hardware.
  4. The changes made, each with its latency math and
    caveats.
  5. Validation
    bit-exactness probes and regression coverage.
  6. Further proposals not implemented,
    with the risks that kept them out of this pass.

Hardware/context for all numbers: RTX 4090 (shared with other
processes, ~7 GB free VRAM), TRT decoder decoder_mixed_refit_b8_60s
(legacy pre-spectral build), depth 4, steps 8, 30 s source, ODE mode
unless stated.


1. The knob-to-ear latency path

A knob change traverses, in order:

Stage Typical cost Owner / control
Client knob smoothing 0–smoothMs web UI (useParamSync); param send ~125 Hz
WS transport ~0–2 ms (LAN) TCP_NODELAY already set
set_knobs → next runner iteration 0–1 tick PipelineRunner loop, no inter-tick sleeps
Shared-curve params (SDE strength, velocity, x0 strength…) 1 tick set_shared_curve overrides all in-flight slots next tick
Per-slot params (ODE denoise, seed) queue wait + steps ticks submit queue + ring buffer drain
Finished-latent delivery 0 ticks (was 1) StreamPipeline.tick()
VAE windowed decode + crossfade + emit ~8 ms decode + render vae_window=0.4 s, TRT VAE
Playback lead (buffer between write head and playhead) 0.12–1.35 s adaptive PipelineRunner lead controller
Client AudioWorklet buffer ~10–30 ms web audio path

Two structural facts dominate everything else:

  • Parameter class decides engine latency. Shared-curve parameters
    reach the output in one tick (~36 ms on TRT). Per-slot parameters
    (denoise in ODE mode, seed) only land on newly submitted
    requests, which must wait in the submit queue and then run their
    full steps schedule. The demo's "strength" knob is per-slot in ODE
    mode and shared-curve in SDE mode — SDE mode is an order of
    magnitude more responsive by construction.
  • The playback lead is the dominant audible term. The engine can
    converge in ~300 ms while the listener still waits for the playhead
    to reach the rewritten region. The adaptive lead controller
    (PipelineRunner) floors near 0.2 s on a healthy GPU; everything
    the engine gains shortens the fresh-generation horizon that the
    lead is measured against.

2. Test strategy

New suite: tests/e2e_latency/ — engine-level and full-stack
measurements outside the demo app, black-box at public API
boundaries so the same measurements stay valid across internal
refactors.

# engine-level: knob -> finished latent (ticks + ms), determinism guard
.venv/Scripts/python.exe -m pytest tests/e2e_latency/test_knob_to_latent.py -v -s

# full stack: headless StreamingSession, simulated client heartbeat
.venv/Scripts/python.exe -m pytest tests/e2e_latency/test_streaming_session.py -v -s

Environment: DEMON_E2E_ACCEL=tensorrt|eager|compile (default:
tensorrt when engines exist), DEMON_E2E_GPU (default: most free
VRAM), DEMON_E2E_DEPTH / DEMON_E2E_STEPS. Reports land in
runs/latency-reports/e2e-*.json for build-to-build diffing —
treat the diff, not the pass/fail, as the output; assertions are
coarse architectural ceilings only.

Measurement design:

  • Functional change detection. With a fixed seed and constant
    knobs the pipeline reaches a steady state where every finished
    latent is bit-identical. After a knob flip, the first differing
    finished latent marks first change; the first repeating new latent
    marks converged. No pipeline internals are inspected.
  • Full-stack timing subscribes to AudioReady events from a
    headless StreamingSession and feeds set_knobs with an advancing
    playback_pos heartbeat, mirroring the web client. It reports
    first-write, write gaps, measured lead, tick/decode percentiles,
    generation rate, and knob→fresh-generation latency (first write
    whose window was generated entirely after the knob).
  • Determinism guard. Two fresh streams with identical knobs must
    produce bit-identical steady-state latents
    (test_seed_determinism_streaming).

The conftest patches the canonical TRT profile table to the legacy
(pre-spectral) decoder engine names when the spectral builds are
missing on the machine — test-process only.

The pre-existing golden harness (tests/golden/) remains the wire-level
regression net; it was not run here (refs are captured on an RTX 5090
and require calibration on other cards — see its README).


3. Measured results

Engine level (TRT, depth 4, steps 8)

Metric Before After Δ
tick p50 36.7 ms 35.7 ms −3 %
denoise knob first change 13 ticks / 486 ms 8 ticks / 282 ms −42 %
denoise knob converged 14 ticks / 522 ms 13 ticks / 459 ms −12 %
shared curve first change 2 ticks / 74 ms 1 tick / 37 ms −50 %
shared curve converged 10 ticks / 366 ms 9 ticks / 321 ms −12 %

Eager backend (same machine): tick p50 130 → 115 ms (−12 %, the CPU-side
wins matter more without TRT); denoise first change 14 → 8 ticks.

Full stack (headless StreamingSession, TRT)

Metric Before After Δ
knob → fresh generation 406 ms 172 ms −58 %
first write after start 0.50 s 0.20–0.33 s −34…60 %
write gap p50 63 ms 63 ms
measured lead p50 0.197 s 0.198 s
generations/s 7.14 7.29 +2 %

(knob_to_next_write is a single-sample phase measurement and jitters
between 16–47 ms; not meaningful at this resolution.)


4. Changes made

4.1 Submit-queue cap = 1 in streaming mode

StreamPipeline queued up to depth requests; streaming callers
submit a fresh request every tick, so a knob change sat behind up to
depth−1 stale requests, each costing ~steps/depth ticks before a
retiring slot picked it up. StreamPipeline now takes
queue_cap (default: historical depth), and StreamDenoise
constructs its pipeline with queue_cap=1 — a retiring slot is always
refilled with the freshest parameters.

  • Latency math: removes ~(depth−1)·steps/depth ticks of queue wait
    from every per-slot parameter (measured: −5 ticks at d4/s8).
  • Throughput: unchanged — submissions arrive every tick, so the queue
    can never starve.
  • Caveats: callers that batch-submit distinct requests faster than
    ticks would lose intermediates — no such caller exists
    (StreamDenoise is the only production submit site; drain mode
    submits exactly one). Walk-window chunk selection is playhead-driven
    per submission, so latest-wins is the correct semantic there too.
    Direct StreamPipeline users (calibration script, parity tests)
    keep the old default.

4.2 Same-tick delivery of finished latents

tick() previously parked a slot that reached its final step until
the next tick's scan returned it. Newly finished slots are now
delivered at the end of the tick that finished them. One full tick
(~36 ms TRT / ~115 ms eager) shaved off every generation,
permanently. The one-result-per-tick contract is preserved (a pre-tick
scan still drains leftovers when several slots finish together, one
per tick, which only happens when depth ≥ steps).

Caveat: inter-tick observers see slightly different state
(active_slots drops one tick earlier; is_warmed_up is stats-only).
No control flow in the repo depends on the old phase.

4.3 Removed host syncs after TRT enqueues (decoder + VAE decode)

_trt_forward ended with self._trt_stream.synchronize();
_trt_vae_decode did the same. The polygraphy stream is created with
cudaStreamCreate — a blocking stream with respect to the legacy
default stream PyTorch uses — so every subsequent torch op is already
implicitly ordered after the TRT execution on the GPU. The explicit
syncs only blocked the host. With them gone, the CPU enqueues the
integration math / emission prep while the engine is still executing.

  • Caveats: _last_tick_ms now measures mostly enqueue time; per-loop
    wall timings in the runner stay accurate because emission's .cpu()
    is a natural sync point, but sub-component attribution shifts
    toward whatever op forces the sync. If PyTorch is ever moved off
    the legacy default stream (per-thread default streams, explicit
    torch.cuda.Stream contexts around the tick), the implicit
    ordering assumption must be revisited — the right fix then is CUDA
    events, not host syncs. One-shot paths (diffusion.py,
    trt/runtime.py TRTDecoder, VAE encode) keep their syncs — they
    are cold paths and not worth the risk surface.

4.4 Shared-curve hot-path caching

  • set_shared_curve now skips re-normalization when the same scalar
    value or tensor object is pushed again (the streaming backend pushes
    the full knob state every tick).
  • The device/dtype cast of shared curves is cached
    (_shared_curves_dev), eliminating a per-slot-per-tick H2D copy for
    CPU-resident curves (SDE denoise curve at T=750 was 4 uploads/tick).
  • ace_backend rebuilds the knob-driven SDE curve only when
    (amplitude, periodicity, src_T) actually move, and reuses the same
    tensor object so the setter's skip fires.
  • The x0_target_strength "is it active" gate is now a host-side flag
    computed once at the setter / slot init instead of a
    per-slot-per-step tensor.any().item() fence.

Caveat: in-place mutation of a tensor passed to set_shared_curve
followed by re-setting the same object is now a no-op; all callers
build fresh tensors when values change.

4.5 Micro-opts in the TRT forward

  • Timestep rows staged on CPU and shipped in one H2D copy instead of
    B per-element writes.
  • The steering buffer zero-fill is skipped when no steering configs
    are active and the buffer is already clean (dirty flag per cache
    entry); buffers are allocated zeroed.

4.6 Schedule cache LRU bound

_schedule_cache (denoise → CPU schedule tensor) grew unboundedly —
every distinct float from a swept knob is a key. Now an LRU capped at
256 entries. Values are not quantized: quantizing would change
schedules and break bit-exactness for nearby denoise values.


5. Validation (bit-exactness and regressions)

  • Cross-build bit-exactness probe (test_output/bitexact_probe.py):
    steady-state finished-latent SHA256 under (a) plain ODE, (b) shared
    velocity curve, (c) shared x0_target_strength + target latent, with
    a seeded VAE encode. Identical hashes on the baseline build (changes
    stashed) and the modified build:
    d5148851… / 095f3d0c… / baccf5ce…. The math is byte-identical;
    the changes alter only when parameters land and when latents are
    delivered.
  • Unit suite: 149 passed (4 skipped: checkpoints not on disk;
    test_deck_mix.py / test_stem_source_mode_gating.py are stale
    uncommitted files from the deck branch and fail to import on main
    with or without these changes).
  • Adapter parity rail (test_ace_adapter_parity.py): passes.
  • Streaming determinism (test_seed_determinism_streaming):
    passes on TRT and eager.

What is intentionally not identical to the old build: wire-level
streaming audio. Slots now run fresher requests (queue cap) and
deliver a tick earlier, so the windows written at a given wall time
differ — that is the improvement, not a regression. The golden
harness's tier-2 tolerance comparison is the right tool to confirm
perceptual equivalence; its refs need one-time calibration on this
card class (see tests/golden/README.md).


6. Further proposals (with caveats)

Ordered by expected payoff:

  1. Prefer SDE mode (or shared-curve denoise) for the strength
    knob.
    The single biggest knob-latency lever costs no engine work:
    per-slot ODE denoise is structurally steps ticks slower than the
    shared-curve path. An ODE-mode shared "denoise-like" control (e.g.
    mapping strength onto a velocity/noise curve) would make ODE feel
    like SDE. Caveat: changing a slot's t-schedule mid-flight is not
    mathematically equivalent to truncating it; this needs design, not
    just plumbing.
  2. Lead floor tuning. Measured lead p50 sits at ~0.2 s; the floor
    and release time-constant are env-tunable. Lowering the floor cuts
    audible latency 1:1, at the cost of underrun margin on slow ticks
    (LoRA refit stalls, profile swaps). Worth a guarded experiment with
    the gap-fill telemetry watched.
  3. CUDA Graphs over the TRT forward + integration. The tick is
    now enqueue-bound; capturing the steady-state tick as a graph would
    collapse launch overhead (~tens of µs × dozens of kernels). Caveats:
    shape changes (T transitions, CFG row counts), TRT execute_async_v3
    inside a torch graph capture is fragile, and refit invalidates
    captures — a large, risky change for maybe 10–20 % of tick time.
  4. Event-based timing. Replace wall-clock tick_ms/dec_ms with
    CUDA event pairs so the runner's pacing logic sees true GPU cost
    without forcing syncs. Low risk; do it when timing fidelity starts
    driving decisions (e.g. for the lead controller).
  5. Pinned staging for per-tick H2D (timesteps, CPU noise, CPU
    curves). The single-copy timestep fill landed; promoting the rest
    to pinned+async needs double-buffering against in-flight reuse.
  6. Noise caching per integer seed. _make_noise reseeds the
    global RNG and generates on CPU every submission (~1 ms at T=750).
    Caching (seed, T, D) → noise would also stop the global RNG
    side-effects. Caveat: SDE re-noise draws from the same global RNG
    stream, so removing the per-tick manual_seed changes SDE output
    — must be gated to ODE or paired with a dedicated generator, and
    golden-verified.
  7. Spectral engine rebuild. This machine runs legacy engines
    without the steering binding. Rebuilding (python -m acestep.engine.trt.build --all) restores steering and picks up
    current TRT kernel improvements. FP8 (fp8_mixed) decoder engines
    cut forward time further on Ada+; needs the calibration flow from
    docs/TRT.md and a listening pass.
  8. Walk-window chunk pre-encoding and conditioning re-encode
    avoidance
    on prompt edits: both are stall (p99) reducers rather
    than steady-state wins; the prompt path already caches by text.

Signed-off-by: BuffMcBigHuge <marco@bymar.co>
@BuffMcBigHuge BuffMcBigHuge marked this pull request as ready for review June 10, 2026 02:56
@leszko leszko requested review from j0sh and ryanontheinside June 10, 2026 06:36

@leszko leszko left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants