Latency Improvement Experimentation#238
Open
BuffMcBigHuge wants to merge 1 commit into
Open
Conversation
Signed-off-by: BuffMcBigHuge <marco@bymar.co>
This was referenced Jun 11, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
DEMON end-to-end latency: review, measurements, changes
This document is the result of a full latency review of the DEMON
streaming stack — from a knob turn in the web demo (or any client) to
the audio that comes out — including the TRT inference layer. It
covers:
every contributor we control.
tests/e2e_latency/,runnable outside the demo app).
build's hardware.
caveats.
bit-exactness probes and regression coverage.
with the risks that kept them out of this pass.
Hardware/context for all numbers: RTX 4090 (shared with other
processes, ~7 GB free VRAM), TRT decoder
decoder_mixed_refit_b8_60s(legacy pre-spectral build), depth 4, steps 8, 30 s source, ODE mode
unless stated.
1. The knob-to-ear latency path
A knob change traverses, in order:
smoothMsuseParamSync); param send ~125 Hzset_knobs→ next runner iterationPipelineRunnerloop, no inter-tick sleepsset_shared_curveoverrides all in-flight slots next tickstepsticksStreamPipeline.tick()vae_window=0.4 s, TRT VAEPipelineRunnerlead controllerTwo structural facts dominate everything else:
reach the output in one tick (~36 ms on TRT). Per-slot parameters
(
denoisein ODE mode,seed) only land on newly submittedrequests, which must wait in the submit queue and then run their
full
stepsschedule. The demo's "strength" knob is per-slot in ODEmode and shared-curve in SDE mode — SDE mode is an order of
magnitude more responsive by construction.
converge in ~300 ms while the listener still waits for the playhead
to reach the rewritten region. The adaptive lead controller
(
PipelineRunner) floors near 0.2 s on a healthy GPU; everythingthe engine gains shortens the fresh-generation horizon that the
lead is measured against.
2. Test strategy
New suite:
tests/e2e_latency/— engine-level and full-stackmeasurements outside the demo app, black-box at public API
boundaries so the same measurements stay valid across internal
refactors.
Environment:
DEMON_E2E_ACCEL=tensorrt|eager|compile(default:tensorrt when engines exist),
DEMON_E2E_GPU(default: most freeVRAM),
DEMON_E2E_DEPTH/DEMON_E2E_STEPS. Reports land inruns/latency-reports/e2e-*.jsonfor build-to-build diffing —treat the diff, not the pass/fail, as the output; assertions are
coarse architectural ceilings only.
Measurement design:
knobs the pipeline reaches a steady state where every finished
latent is bit-identical. After a knob flip, the first differing
finished latent marks first change; the first repeating new latent
marks converged. No pipeline internals are inspected.
AudioReadyevents from aheadless
StreamingSessionand feedsset_knobswith an advancingplayback_posheartbeat, mirroring the web client. It reportsfirst-write, write gaps, measured lead, tick/decode percentiles,
generation rate, and knob→fresh-generation latency (first write
whose window was generated entirely after the knob).
produce bit-identical steady-state latents
(
test_seed_determinism_streaming).The conftest patches the canonical TRT profile table to the legacy
(pre-spectral) decoder engine names when the spectral builds are
missing on the machine — test-process only.
The pre-existing golden harness (
tests/golden/) remains the wire-levelregression net; it was not run here (refs are captured on an RTX 5090
and require calibration on other cards — see its README).
3. Measured results
Engine level (TRT, depth 4, steps 8)
denoiseknob first changedenoiseknob convergedEager backend (same machine): tick p50 130 → 115 ms (−12 %, the CPU-side
wins matter more without TRT); denoise first change 14 → 8 ticks.
Full stack (headless StreamingSession, TRT)
(
knob_to_next_writeis a single-sample phase measurement and jittersbetween 16–47 ms; not meaningful at this resolution.)
4. Changes made
4.1 Submit-queue cap = 1 in streaming mode
StreamPipelinequeued up todepthrequests; streaming callerssubmit a fresh request every tick, so a knob change sat behind up to
depth−1stale requests, each costing ~steps/depthticks before aretiring slot picked it up.
StreamPipelinenow takesqueue_cap(default: historicaldepth), andStreamDenoiseconstructs its pipeline with
queue_cap=1— a retiring slot is alwaysrefilled with the freshest parameters.
(depth−1)·steps/depthticks of queue waitfrom every per-slot parameter (measured: −5 ticks at d4/s8).
can never starve.
ticks would lose intermediates — no such caller exists
(
StreamDenoiseis the only production submit site; drain modesubmits exactly one). Walk-window chunk selection is playhead-driven
per submission, so latest-wins is the correct semantic there too.
Direct
StreamPipelineusers (calibration script, parity tests)keep the old default.
4.2 Same-tick delivery of finished latents
tick()previously parked a slot that reached its final step untilthe next tick's scan returned it. Newly finished slots are now
delivered at the end of the tick that finished them. One full tick
(~36 ms TRT / ~115 ms eager) shaved off every generation,
permanently. The one-result-per-tick contract is preserved (a pre-tick
scan still drains leftovers when several slots finish together, one
per tick, which only happens when
depth ≥ steps).Caveat: inter-tick observers see slightly different state
(
active_slotsdrops one tick earlier;is_warmed_upis stats-only).No control flow in the repo depends on the old phase.
4.3 Removed host syncs after TRT enqueues (decoder + VAE decode)
_trt_forwardended withself._trt_stream.synchronize();_trt_vae_decodedid the same. The polygraphy stream is created withcudaStreamCreate— a blocking stream with respect to the legacydefault stream PyTorch uses — so every subsequent torch op is already
implicitly ordered after the TRT execution on the GPU. The explicit
syncs only blocked the host. With them gone, the CPU enqueues the
integration math / emission prep while the engine is still executing.
_last_tick_msnow measures mostly enqueue time; per-loopwall timings in the runner stay accurate because emission's
.cpu()is a natural sync point, but sub-component attribution shifts
toward whatever op forces the sync. If PyTorch is ever moved off
the legacy default stream (per-thread default streams, explicit
torch.cuda.Streamcontexts around the tick), the implicitordering assumption must be revisited — the right fix then is CUDA
events, not host syncs. One-shot paths (
diffusion.py,trt/runtime.pyTRTDecoder, VAE encode) keep their syncs — theyare cold paths and not worth the risk surface.
4.4 Shared-curve hot-path caching
set_shared_curvenow skips re-normalization when the same scalarvalue or tensor object is pushed again (the streaming backend pushes
the full knob state every tick).
(
_shared_curves_dev), eliminating a per-slot-per-tick H2D copy forCPU-resident curves (SDE denoise curve at T=750 was 4 uploads/tick).
ace_backendrebuilds the knob-driven SDE curve only when(amplitude, periodicity, src_T) actually move, and reuses the same
tensor object so the setter's skip fires.
x0_target_strength"is it active" gate is now a host-side flagcomputed once at the setter / slot init instead of a
per-slot-per-step
tensor.any().item()fence.Caveat: in-place mutation of a tensor passed to
set_shared_curvefollowed by re-setting the same object is now a no-op; all callers
build fresh tensors when values change.
4.5 Micro-opts in the TRT forward
B per-element writes.
are active and the buffer is already clean (dirty flag per cache
entry); buffers are allocated zeroed.
4.6 Schedule cache LRU bound
_schedule_cache(denoise → CPU schedule tensor) grew unboundedly —every distinct float from a swept knob is a key. Now an LRU capped at
256 entries. Values are not quantized: quantizing would change
schedules and break bit-exactness for nearby denoise values.
5. Validation (bit-exactness and regressions)
test_output/bitexact_probe.py):steady-state finished-latent SHA256 under (a) plain ODE, (b) shared
velocity curve, (c) shared x0_target_strength + target latent, with
a seeded VAE encode. Identical hashes on the baseline build (changes
stashed) and the modified build:
d5148851…/095f3d0c…/baccf5ce…. The math is byte-identical;the changes alter only when parameters land and when latents are
delivered.
test_deck_mix.py/test_stem_source_mode_gating.pyare staleuncommitted files from the deck branch and fail to import on main
with or without these changes).
test_ace_adapter_parity.py): passes.test_seed_determinism_streaming):passes on TRT and eager.
What is intentionally not identical to the old build: wire-level
streaming audio. Slots now run fresher requests (queue cap) and
deliver a tick earlier, so the windows written at a given wall time
differ — that is the improvement, not a regression. The golden
harness's tier-2 tolerance comparison is the right tool to confirm
perceptual equivalence; its refs need one-time calibration on this
card class (see
tests/golden/README.md).6. Further proposals (with caveats)
Ordered by expected payoff:
knob. The single biggest knob-latency lever costs no engine work:
per-slot ODE denoise is structurally
stepsticks slower than theshared-curve path. An ODE-mode shared "denoise-like" control (e.g.
mapping strength onto a velocity/noise curve) would make ODE feel
like SDE. Caveat: changing a slot's t-schedule mid-flight is not
mathematically equivalent to truncating it; this needs design, not
just plumbing.
and release time-constant are env-tunable. Lowering the floor cuts
audible latency 1:1, at the cost of underrun margin on slow ticks
(LoRA refit stalls, profile swaps). Worth a guarded experiment with
the gap-fill telemetry watched.
now enqueue-bound; capturing the steady-state tick as a graph would
collapse launch overhead (~tens of µs × dozens of kernels). Caveats:
shape changes (T transitions, CFG row counts), TRT
execute_async_v3inside a torch graph capture is fragile, and refit invalidates
captures — a large, risky change for maybe 10–20 % of tick time.
tick_ms/dec_mswithCUDA event pairs so the runner's pacing logic sees true GPU cost
without forcing syncs. Low risk; do it when timing fidelity starts
driving decisions (e.g. for the lead controller).
curves). The single-copy timestep fill landed; promoting the rest
to pinned+async needs double-buffering against in-flight reuse.
_make_noisereseeds theglobal RNG and generates on CPU every submission (~1 ms at T=750).
Caching
(seed, T, D)→ noise would also stop the global RNGside-effects. Caveat: SDE re-noise draws from the same global RNG
stream, so removing the per-tick
manual_seedchanges SDE output— must be gated to ODE or paired with a dedicated generator, and
golden-verified.
without the steering binding. Rebuilding (
python -m acestep.engine.trt.build --all) restores steering and picks upcurrent TRT kernel improvements. FP8 (
fp8_mixed) decoder enginescut forward time further on Ada+; needs the calibration flow from
docs/TRT.mdand a listening pass.avoidance on prompt edits: both are stall (p99) reducers rather
than steady-state wins; the prompt path already caches by text.