Latency Improvement Experimentation by BuffMcBigHuge · Pull Request #238 · daydreamlive/DEMON

BuffMcBigHuge · 2026-06-10T00:25:45Z

DEMON end-to-end latency: review, measurements, changes

This work was done in collaboration with Claude Fable Extra High

This document is the result of a full latency review of the DEMON
streaming stack — from a knob turn in the web demo (or any client) to
the audio that comes out — including the TRT inference layer. It
covers:

The knob-to-ear latency path and
every contributor we control.
The e2e test strategy (tests/e2e_latency/,
runnable outside the demo app).
Measured results before/after on this
build's hardware.
The changes made, each with its latency math and
caveats.
Validation —
bit-exactness probes and regression coverage.
Further proposals not implemented,
with the risks that kept them out of this pass.

Hardware/context for all numbers: RTX 4090 (shared with other
processes, ~7 GB free VRAM), TRT decoder decoder_mixed_refit_b8_60s
(legacy pre-spectral build), depth 4, steps 8, 30 s source, ODE mode
unless stated.

1. The knob-to-ear latency path

A knob change traverses, in order:

Stage	Typical cost	Owner / control
Client knob smoothing	0–`smoothMs`	web UI (`useParamSync`); param send ~125 Hz
WS transport	~0–2 ms (LAN)	TCP_NODELAY already set
`set_knobs` → next runner iteration	0–1 tick	`PipelineRunner` loop, no inter-tick sleeps
Shared-curve params (SDE strength, velocity, x0 strength…)	1 tick	`set_shared_curve` overrides all in-flight slots next tick
Per-slot params (ODE denoise, seed)	queue wait + `steps` ticks	submit queue + ring buffer drain
Finished-latent delivery	0 ticks (was 1)	`StreamPipeline.tick()`
VAE windowed decode + crossfade + emit	~8 ms decode + render	`vae_window=0.4 s`, TRT VAE
Playback lead (buffer between write head and playhead)	0.12–1.35 s adaptive	`PipelineRunner` lead controller
Client AudioWorklet buffer	~10–30 ms	web audio path

Two structural facts dominate everything else:

Parameter class decides engine latency. Shared-curve parameters
reach the output in one tick (~36 ms on TRT). Per-slot parameters
(denoise in ODE mode, seed) only land on newly submitted
requests, which must wait in the submit queue and then run their
full steps schedule. The demo's "strength" knob is per-slot in ODE
mode and shared-curve in SDE mode — SDE mode is an order of
magnitude more responsive by construction.
The playback lead is the dominant audible term. The engine can
converge in ~300 ms while the listener still waits for the playhead
to reach the rewritten region. The adaptive lead controller
(PipelineRunner) floors near 0.2 s on a healthy GPU; everything
the engine gains shortens the fresh-generation horizon that the
lead is measured against.

2. Test strategy

New suite: tests/e2e_latency/ — engine-level and full-stack
measurements outside the demo app, black-box at public API
boundaries so the same measurements stay valid across internal
refactors.

# engine-level: knob -> finished latent (ticks + ms), determinism guard
.venv/Scripts/python.exe -m pytest tests/e2e_latency/test_knob_to_latent.py -v -s

# full stack: headless StreamingSession, simulated client heartbeat
.venv/Scripts/python.exe -m pytest tests/e2e_latency/test_streaming_session.py -v -s

Environment: DEMON_E2E_ACCEL=tensorrt|eager|compile (default:
tensorrt when engines exist), DEMON_E2E_GPU (default: most free
VRAM), DEMON_E2E_DEPTH / DEMON_E2E_STEPS. Reports land in
runs/latency-reports/e2e-*.json for build-to-build diffing —
treat the diff, not the pass/fail, as the output; assertions are
coarse architectural ceilings only.

Measurement design:

Functional change detection. With a fixed seed and constant
knobs the pipeline reaches a steady state where every finished
latent is bit-identical. After a knob flip, the first differing
finished latent marks first change; the first repeating new latent
marks converged. No pipeline internals are inspected.
Full-stack timing subscribes to AudioReady events from a
headless StreamingSession and feeds set_knobs with an advancing
playback_pos heartbeat, mirroring the web client. It reports
first-write, write gaps, measured lead, tick/decode percentiles,
generation rate, and knob→fresh-generation latency (first write
whose window was generated entirely after the knob).
Determinism guard. Two fresh streams with identical knobs must
produce bit-identical steady-state latents
(test_seed_determinism_streaming).

The conftest patches the canonical TRT profile table to the legacy
(pre-spectral) decoder engine names when the spectral builds are
missing on the machine — test-process only.

The pre-existing golden harness (tests/golden/) remains the wire-level
regression net; it was not run here (refs are captured on an RTX 5090
and require calibration on other cards — see its README).

3. Measured results

Engine level (TRT, depth 4, steps 8)

Metric	Before	After	Δ
tick p50	36.7 ms	35.7 ms	−3 %
`denoise` knob first change	13 ticks / 486 ms	8 ticks / 282 ms	−42 %
`denoise` knob converged	14 ticks / 522 ms	13 ticks / 459 ms	−12 %
shared curve first change	2 ticks / 74 ms	1 tick / 37 ms	−50 %
shared curve converged	10 ticks / 366 ms	9 ticks / 321 ms	−12 %

Eager backend (same machine): tick p50 130 → 115 ms (−12 %, the CPU-side
wins matter more without TRT); denoise first change 14 → 8 ticks.

Full stack (headless StreamingSession, TRT)

Metric	Before	After	Δ
knob → fresh generation	406 ms	172 ms	−58 %
first write after start	0.50 s	0.20–0.33 s	−34…60 %
write gap p50	63 ms	63 ms	—
measured lead p50	0.197 s	0.198 s	—
generations/s	7.14	7.29	+2 %

(knob_to_next_write is a single-sample phase measurement and jitters
between 16–47 ms; not meaningful at this resolution.)

4. Changes made

4.1 Submit-queue cap = 1 in streaming mode

StreamPipeline queued up to depth requests; streaming callers
submit a fresh request every tick, so a knob change sat behind up to
depth−1 stale requests, each costing ~steps/depth ticks before a
retiring slot picked it up. StreamPipeline now takes
queue_cap (default: historical depth), and StreamDenoise
constructs its pipeline with queue_cap=1 — a retiring slot is always
refilled with the freshest parameters.

Latency math: removes ~(depth−1)·steps/depth ticks of queue wait
from every per-slot parameter (measured: −5 ticks at d4/s8).
Throughput: unchanged — submissions arrive every tick, so the queue
can never starve.
Caveats: callers that batch-submit distinct requests faster than
ticks would lose intermediates — no such caller exists
(StreamDenoise is the only production submit site; drain mode
submits exactly one). Walk-window chunk selection is playhead-driven
per submission, so latest-wins is the correct semantic there too.
Direct StreamPipeline users (calibration script, parity tests)
keep the old default.

4.2 Same-tick delivery of finished latents

tick() previously parked a slot that reached its final step until
the next tick's scan returned it. Newly finished slots are now
delivered at the end of the tick that finished them. One full tick
(~36 ms TRT / ~115 ms eager) shaved off every generation,
permanently. The one-result-per-tick contract is preserved (a pre-tick
scan still drains leftovers when several slots finish together, one
per tick, which only happens when depth ≥ steps).

Caveat: inter-tick observers see slightly different state
(active_slots drops one tick earlier; is_warmed_up is stats-only).
No control flow in the repo depends on the old phase.

4.3 Removed host syncs after TRT enqueues (decoder + VAE decode)

_trt_forward ended with self._trt_stream.synchronize();
_trt_vae_decode did the same. The polygraphy stream is created with
cudaStreamCreate — a blocking stream with respect to the legacy
default stream PyTorch uses — so every subsequent torch op is already
implicitly ordered after the TRT execution on the GPU. The explicit
syncs only blocked the host. With them gone, the CPU enqueues the
integration math / emission prep while the engine is still executing.

Caveats: _last_tick_ms now measures mostly enqueue time; per-loop
wall timings in the runner stay accurate because emission's .cpu()
is a natural sync point, but sub-component attribution shifts
toward whatever op forces the sync. If PyTorch is ever moved off
the legacy default stream (per-thread default streams, explicit
torch.cuda.Stream contexts around the tick), the implicit
ordering assumption must be revisited — the right fix then is CUDA
events, not host syncs. One-shot paths (diffusion.py,
trt/runtime.py TRTDecoder, VAE encode) keep their syncs — they
are cold paths and not worth the risk surface.

4.4 Shared-curve hot-path caching

set_shared_curve now skips re-normalization when the same scalar
value or tensor object is pushed again (the streaming backend pushes
the full knob state every tick).
The device/dtype cast of shared curves is cached
(_shared_curves_dev), eliminating a per-slot-per-tick H2D copy for
CPU-resident curves (SDE denoise curve at T=750 was 4 uploads/tick).
ace_backend rebuilds the knob-driven SDE curve only when
(amplitude, periodicity, src_T) actually move, and reuses the same
tensor object so the setter's skip fires.
The x0_target_strength "is it active" gate is now a host-side flag
computed once at the setter / slot init instead of a
per-slot-per-step tensor.any().item() fence.

Caveat: in-place mutation of a tensor passed to set_shared_curve
followed by re-setting the same object is now a no-op; all callers
build fresh tensors when values change.

4.5 Micro-opts in the TRT forward

Timestep rows staged on CPU and shipped in one H2D copy instead of
B per-element writes.
The steering buffer zero-fill is skipped when no steering configs
are active and the buffer is already clean (dirty flag per cache
entry); buffers are allocated zeroed.

4.6 Schedule cache LRU bound

_schedule_cache (denoise → CPU schedule tensor) grew unboundedly —
every distinct float from a swept knob is a key. Now an LRU capped at
256 entries. Values are not quantized: quantizing would change
schedules and break bit-exactness for nearby denoise values.

5. Validation (bit-exactness and regressions)

Cross-build bit-exactness probe (test_output/bitexact_probe.py):
steady-state finished-latent SHA256 under (a) plain ODE, (b) shared
velocity curve, (c) shared x0_target_strength + target latent, with
a seeded VAE encode. Identical hashes on the baseline build (changes
stashed) and the modified build:
d5148851… / 095f3d0c… / baccf5ce…. The math is byte-identical;
the changes alter only when parameters land and when latents are
delivered.
Unit suite: 149 passed (4 skipped: checkpoints not on disk;
test_deck_mix.py / test_stem_source_mode_gating.py are stale
uncommitted files from the deck branch and fail to import on main
with or without these changes).
Adapter parity rail (test_ace_adapter_parity.py): passes.
Streaming determinism (test_seed_determinism_streaming):
passes on TRT and eager.

What is intentionally not identical to the old build: wire-level
streaming audio. Slots now run fresher requests (queue cap) and
deliver a tick earlier, so the windows written at a given wall time
differ — that is the improvement, not a regression. The golden
harness's tier-2 tolerance comparison is the right tool to confirm
perceptual equivalence; its refs need one-time calibration on this
card class (see tests/golden/README.md).

6. Further proposals (with caveats)

Ordered by expected payoff:

Prefer SDE mode (or shared-curve denoise) for the strength
knob. The single biggest knob-latency lever costs no engine work:
per-slot ODE denoise is structurally steps ticks slower than the
shared-curve path. An ODE-mode shared "denoise-like" control (e.g.
mapping strength onto a velocity/noise curve) would make ODE feel
like SDE. Caveat: changing a slot's t-schedule mid-flight is not
mathematically equivalent to truncating it; this needs design, not
just plumbing.
Lead floor tuning. Measured lead p50 sits at ~0.2 s; the floor
and release time-constant are env-tunable. Lowering the floor cuts
audible latency 1:1, at the cost of underrun margin on slow ticks
(LoRA refit stalls, profile swaps). Worth a guarded experiment with
the gap-fill telemetry watched.
CUDA Graphs over the TRT forward + integration. The tick is
now enqueue-bound; capturing the steady-state tick as a graph would
collapse launch overhead (~tens of µs × dozens of kernels). Caveats:
shape changes (T transitions, CFG row counts), TRT execute_async_v3
inside a torch graph capture is fragile, and refit invalidates
captures — a large, risky change for maybe 10–20 % of tick time.
Event-based timing. Replace wall-clock tick_ms/dec_ms with
CUDA event pairs so the runner's pacing logic sees true GPU cost
without forcing syncs. Low risk; do it when timing fidelity starts
driving decisions (e.g. for the lead controller).
Pinned staging for per-tick H2D (timesteps, CPU noise, CPU
curves). The single-copy timestep fill landed; promoting the rest
to pinned+async needs double-buffering against in-flight reuse.
Noise caching per integer seed. _make_noise reseeds the
global RNG and generates on CPU every submission (~1 ms at T=750).
Caching (seed, T, D) → noise would also stop the global RNG
side-effects. Caveat: SDE re-noise draws from the same global RNG
stream, so removing the per-tick manual_seed changes SDE output
— must be gated to ODE or paired with a dedicated generator, and
golden-verified.
Spectral engine rebuild. This machine runs legacy engines
without the steering binding. Rebuilding (python -m acestep.engine.trt.build --all) restores steering and picks up
current TRT kernel improvements. FP8 (fp8_mixed) decoder engines
cut forward time further on Ada+; needs the calibration flow from
docs/TRT.md and a listening pass.
Walk-window chunk pre-encoding and conditioning re-encode
avoidance on prompt edits: both are stall (p99) reducers rather
than steady-state wins; the prompt path already caches by text.

Signed-off-by: BuffMcBigHuge <marco@bymar.co>

leszko

LGTM

Latency experiementation.

9955098

Signed-off-by: BuffMcBigHuge <marco@bymar.co>

BuffMcBigHuge marked this pull request as ready for review June 10, 2026 02:56

leszko requested review from j0sh and ryanontheinside June 10, 2026 06:36

leszko approved these changes Jun 10, 2026

View reviewed changes

This was referenced Jun 11, 2026

Ryanontheinside/feat/latency/06 near playhead repatch #240

Draft

VRAM Pressure - Melband #244

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Latency Improvement Experimentation#238

Latency Improvement Experimentation#238
BuffMcBigHuge wants to merge 1 commit into
mainfrom
marco/feat/latency

BuffMcBigHuge commented Jun 10, 2026

Uh oh!

leszko left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

BuffMcBigHuge commented Jun 10, 2026

DEMON end-to-end latency: review, measurements, changes

1. The knob-to-ear latency path

2. Test strategy

3. Measured results

Engine level (TRT, depth 4, steps 8)

Full stack (headless StreamingSession, TRT)

4. Changes made

4.1 Submit-queue cap = 1 in streaming mode

4.2 Same-tick delivery of finished latents

4.3 Removed host syncs after TRT enqueues (decoder + VAE decode)

4.4 Shared-curve hot-path caching

4.5 Micro-opts in the TRT forward

4.6 Schedule cache LRU bound

5. Validation (bit-exactness and regressions)

6. Further proposals (with caveats)

Uh oh!

leszko left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants