Ryanontheinside/feat/latency/06 near playhead repatch#240
Ryanontheinside/feat/latency/06 near playhead repatch#240ryanontheinside wants to merge 7 commits into
Conversation
The produce bracket ran two full-device torch.cuda.synchronize() per tick and render_window/render_full a third, all purely to measure last_tick_ms/last_dec_ms. Record CUDA events around the engine step and the decode instead: the produce bracket resolves lazily at the start of the next produce (one tick stale; both readers are diagnostics), and the decode bracket resolves right after the waveform's D2H copy, which already drained the stream. Removes every measurement-only host-device sync from the tick loop and lets CPU prep overlap GPU work.
set_shared_curve stored CPU fp32 tensors and _eff_shared re-ran normalize_curve on every read, so every shared override (the exact path the live denoise/guidance knobs ride) cost a fresh CPU alloc plus a host-to-device copy per slot per step. Canonicalize and device-cast once at the setter (same approach set_channel_gain_tensor already uses), return dict hits directly, and memoize normalized SlotRequest scalar fields once per slot in _Slot.curve_cache. Dtype casting stays at the consumer boundary, so the math is byte-identical.
The strength gate ran eff_strength.abs().any().item() per slot per step — a host-device fence in the middle of the integration loop (the old comment even called it out). Track an "any nonzero" flag alongside each shared curve at set_shared_curve (free for scalar sets, one readback per knob write for tensor sets) and memoize the slot-field flag once per slot, computed without tensor ops for Python-scalar fields. The gate decision is unchanged for every input; the strength tensor itself is only fetched on the path that actually blends.
TRT path: the timestep fill wrote one Python scalar per row straight into the device buffer (B tiny H2D writes per forward, doubled under CFG). Stage the rows in a pinned host buffer that lives in the shape-keyed bufs cache and issue a single async copy; ordering with the TRT exec matches the other input copies (legacy default stream vs the blocking polygraphy stream). Eager path: every forward allocated a fresh timestep tensor (pageable H2D) and a fresh all-ones attention mask kernel. Reuse both from tiny per-shape caches on the adapter; values and dtypes are unchanged.
set_prompt, the timbre applies, and clear_timbre held state._lock for the full duration of their GPU encodes (text encoder x2 + VAE for timbre). The runner acquires the same lock at the top of every tick (read_knobs / has_pending_refit), so every prompt or timbre change stalled the entire tick loop — no ticks, no gap-fill — for the encoder's duration. Restructure those flows as snapshot-encode-commit: inputs are read under state._lock, the encodes run unlocked (serialized against each other by the new state._encode_lock, preserving the old encodes-do-not-overlap property), and the commit re-checks the new state.cond_epoch, re-snapshotting and re-encoding if another conditioning commit landed meanwhile (so concurrent prompt/timbre/swap effects all land, matching the old serialized end state). The swap commit keeps its atomic locked shape and bumps the epoch. Timbre apply no longer needs rollback state: nothing mutates before the commit.
The frontier write lands at the adaptive lead, which is floored (0.25s default) for stall safety, and the region between the playhead and that lead is never otherwise rewritten — so the lead floor was the hard audibility floor for every control change regardless of how fast the engine reacted (the measured knob-to-ear was dominated by it). After a real generation (mode=="generate", fresh result) and only within 2s of inbound activity, render a second window from the new latent at playhead + (interval_ema * gain + safety_margin) — the lead formula without the floor and stall bump — and patch it in with the same crossfade/clamp behavior as the frontier write. The frontier write keeps covering the buffer through stalls exactly as before; the close write is opportunistic, so a late landing just leaves the valid older audio in place. Loop-band aware (wraps the target inside an armed band, clamps at B, skips sub-window bands); walk mode passes no band. Does not feed _note_decode_gap (second write in the same tick, like the band-wrap render). Costs one extra windowed decode (~2.4ms) per active-generation tick; idle and untouched sessions pay nothing.
_action_audible took the FIRST qualifying slice in arrival order, which equaled the earliest-audible one only while slice starts advanced monotonically. The near-playhead re-patch emits a second, closer-to-playhead slice a few ms after each frontier write, so the first-arriving qualifying slice is now the FARTHEST one — the metric reported the old lead-floor number (~218ms) while re-patched content sat ~120ms from the playhead. Take min over qualifying slices of max(arrival, playhead-reaches-start); identical for monotonic streams, and a late-arriving slice cannot game it (arrival is inside the max).
BuffMcBigHuge
left a comment
There was a problem hiding this comment.
Full review — read against #238 (marco/feat/latency)
Both PRs came out of independent same-week latency reviews of the same stack, so I diffed both branches side by side against the common base dc06b19. Summary: roughly half of each PR is the same three levers implemented twice; the other halves are disjoint and compose. Net of a merge, almost everything in both survives — details and a suggested reconciliation below.
Where we built the same thing twice
1. x0_target_strength gate readback (8b08147 vs #238 §4.4)
Same diagnosis (per-slot-per-step .abs().any().item() fence in the integration loop), same fix (set-time flags). Yours is the better cut: _curve_nonzero is generalized over curve names, the slot-field flag memoizes lazily and costs no tensor op for Python-scalar fields; ours special-cases x0 with an eager slot-init flag. Take yours; ours drops.
2. Shared-curve hot path (e13f03f vs #238 §4.4)
Same target, different halves of the fix:
- Yours: device-cast at the setter, plus
slot.curve_cachefor the slot-field fallback path (our version still re-normalizes that path on every read). - Ours: setter dedup — re-setting the same scalar / tensor object is a no-op, which matters because
ace_backendpushes the full knob state every tick, so without dedup the setter still re-normalizes once per tick — plus a read-side device+dtype cast cache, and anace_backendSDE-curve rebuild key(amplitude, periodicity, src_T)that makes the dedup actually fire for the tensor curve. - Merged shape: your setter device-cast +
curve_cache, our setter dedup + SDE-curve key. On dtype: you leave the per-step.to(dtype)at the consumer boundary (still an alloc per step when dtypes differ); we cached that cast too — worth keeping exactly one mechanism for it in the merge.
3. Timestep H2D in _trt_forward (0d225e6 vs #238 §4.5)
Yours is strictly better: pinned staging buffer cached in bufs + non_blocking copy, vs our fresh pageable torch.tensor per forward. (Our own writeup listed pinned+async as further-proposal #5 — you built it.) You also covered the eager path in model_adapter, which we didn't touch. Take yours. One invariant worth recording: reusing the pinned host buffer across forwards is safe because the per-tick emission D2H drains the stream before the next tick writes it again — if emission ever stops being per-tick, this needs double-buffering.
4. Host syncs / timing — two non-overlapping halves of the same problem.
70ebbff replaces the measurement syncs (the two per-tick full-device torch.cuda.synchronize() in diffusion_backend + the decode bracket in ace_backend) with CUDA events, but keeps the TRT enqueue syncs. #238 §4.3 removes the enqueue syncs (_trt_stream.synchronize() in _trt_forward and _trt_vae_decode, same blocking-stream ordering argument your commit message makes) but keeps the measurement syncs. Each PR removed exactly the syncs the other kept. Merged, the tick has no host sync left except the natural emission .cpu() — and your events fix the caveat we documented (with #238 alone, last_tick_ms degrades to enqueue time; with your events it's true GPU time again, which matters because the lead controller and golden trace consume it). Your lazy event resolution assumes the render's D2H already drained the stream — that still holds under #238's sync removals, so the composition is sound.
What only #240 has — the half we didn't build
- Near-playhead repatch (3ba48f6) is the headline. Our review called the playback lead "the dominant audible term" but only proposed lead-floor tuning (with the underrun risk that implies); the repatch is the better idea — keep the floored frontier write for stall safety, opportunistically rewrite close-in, let a late landing lose harmlessly. And the two PRs compose multiplicatively here: #238 shortens knob→fresh-latent (queue_cap=1: −5 ticks measured on per-slot params at d4/s8; same-tick delivery: −1 tick on every generation), this PR shortens fresh-latent→ear (~160 ms of buffer distance per your 218→58 ms numbers). On a merged tip, an ODE denoise change gets both.
- Encodes off
state._lock(0cddfae): our further-proposal #8 territory, done properly. The snapshot/encode/commit +cond_epochretry design reads correctly to me. Two small notes: (a)set_promptwritesstate.time_signatureduring the snapshot phase, so a 5×-lost prompt still leaves the ts override applied — probably fine, just asymmetric with the rest of the commit; (b) the 5-strikes drop logs and skipsPromptApplied, so clients waiting on the ack must tolerate a missing event (rare in practice, worth knowing). - Golden metric fix (0539d71): necessary the moment slice starts stop being arrival-monotonic, and min-over-
max(arrival, reach)can't be gamed by late arrivals. Good catch — the e2e harness from #238 has the same first-qualifying-slice assumption in its knob→fresh-generation metric and will need the same fix. - Eager-path buffer reuse in
model_adapter(pinned timestep pair + cached attn-ones).
What only #238 has
queue_cap=1 (the single biggest engine-side win), same-tick latent delivery, the TRT enqueue sync removals, the steering zero-fill dirty flag, the schedule-cache LRU bound, the SDE-curve rebuild cache, and the tests/e2e_latency/ harness (engine-level knob→latent change detection + headless full-stack timing) — which complements your golden-runner work as the build-to-build diffing tool.
Merge mechanics
git merge-tree of the two heads: only acestep/engine/stream.py conflicts — everything else (including ace_backend.py, which both touch) auto-merges. The conflicting hunks are exactly the three duplicated levers above, and the resolution writes itself given the picks: your _curve_nonzero + pinned timestep staging + curve_cache, our setter dedup + queue_cap + same-tick delivery + TRT sync removals + LRU + steering flag.
Suggested order: #238 already has an approval — land it, rebase this stack on it. 01 (events), 05 (session), 06 (repatch), 07 (golden) rebase clean since #238 doesn't touch those files; 02/03/04 shrink to the deltas above. Happy to do the stream.py reconciliation from either side.
One caveat on numbers: both PRs measured against base independently — your 58 ms audible_first doesn't include #238's queue-cap/same-tick wins, our 172 ms knob→fresh-generation doesn't include your repatch. The merged tip deserves a fresh golden + e2e run; the bit-exactness probes from both PRs should still pass, since neither side reorders RNG draws.
Great work — genuinely complementary stacks.
|
@ryanontheinside — full comparison of this stack against #238 ( TL;DRSame week, same stack, two independent reviews. Roughly half overlaps (same diagnosis, different implementations); half is disjoint and composes cleanly. A merged tip keeps almost everything from both PRs. Only Duplicated (pick one implementation per lever)
Only in #240 (we didn't build these)
Your numbers (218→58 ms Only in #238 (you didn't build these)
Suggested landing
Happy to do the |
Knob-to-ear latency stack — plan + progress
Untracked working file. Source review: 2026-06-09 (memory: knob-to-ear-code-levers).
Base: main @ dc06b19. Stack is linear; each branch is based on the previous.
Gates per branch:
.venv/Scripts/python.exe -m pytest tests/unit+ typecheck of touched files.Golden harness: ONLY at the end, on the stack tip, with Ryan's explicit OK.
No pushes / PRs without Ryan's explicit per-action approval.
Out of scope: shadow migration (lives in DEMON_alt2 / exp branches, own landing plan).
Stack (in order)
ryanontheinside/perf/latency/01-cuda-event-timing— [DONE: pending]diffusion_backend.py:102-119 two full-device cuda.synchronize per tick exist only
to time last_tick_ms; ace_backend.py render_window has a third before
.cpu().Replace with CUDA events (lazy elapsed_time read; one-tick-stale timing is fine —
verify nothing reads last_tick_ms same-tick for logic).
ryanontheinside/perf/latency/02-shared-curve-cache— [DONE: pending]stream.py set_shared_curve stores CPU fp32; _eff_shared re-normalizes every read;
_resolve_slot_curves .to(device) per slot per step. Pre-normalize + device-cast at
the setter (precedent: set_channel_gain_tensor), lazy-cast if device unknown at set
time. Slot-field fallback path keeps read-time normalize.
ryanontheinside/perf/latency/03-x0-gate-no-readback— [DONE: pending]stream.py:1226-1231 eff_strength.abs().any().item() = host fence per slot per step.
Compute nonzero-flag once at set_shared_curve / slot init; read the cached bool.
ryanontheinside/perf/latency/04-batched-small-h2d— [DONE: pending]stream.py:721-723 per-element timestep writes -> one pinned staging copy_.
model_adapter.py:166-173 per-forward torch.tensor(timesteps) + torch.ones attn
mask -> cached/reused buffers.
ryanontheinside/perf/latency/05-prompt-encode-unlocked— [DONE: pending]streaming/session.py set_prompt (1396), timbre/structure apply (1212, 1660),
prompt_blend swap (966-990): GPU encode runs under state._lock while the runner
takes the same lock every tick (ace_backend read_knobs:541, has_pending_refit:566).
Encode outside the lock, swap under it; generation counter so a stale encode never
overwrites a newer one.
ryanontheinside/feat/latency/06-near-playhead-repatch— [DONE: pending]pipeline_runner.py: after a fresh produce, additionally render+patch a window at
playhead + transit margin (not just playhead + lead, floor 0.25s @ :239). Latent
state covers all positions (gap-fill proves it). Kills the lead floor as the
audibility floor for knob changes. Mind the monotonic-decode comment (:245-248) —
repatch is a deliberate, separate write path. Default behavior decided here;
measure with golden latency report.
Parity traps
latency report; keep field semantics.
Status log
04:0d225e6 05:0cddfae 06:55f06c5), each green on tests/unit (149 passed).
Commit messages must go via
git commit -F <file>— the settings denylist pattern-matches message text (e.g. "*rd *", "*nc *") and multiline
commands choke the harness.
golden harness last — Ryan approved the golden run 2026-06-09).
full golden 12/12 PASSED at tip in 4:51. knob_step initially showed
first=218/full=578 (vs 234/594 baseline) — repatch WAS firing (109
lat_nearpatch lines, close~0.12-0.18s) but _action_audible took the
first-ARRIVING qualifying slice; repatch slices arrive later with
earlier starts. Fixed metric to min-over-slices of max(arrival,
playhead-reaches-start) (0539d71); trace line amended into 3ba48f6
(06 tip moved 55f06c5 -> 0539d71). audible_full is structurally
playhead-to-frontier distance; the lever for it is shadow migration,
not this stack. Re-validation run + final full golden pending.
baseline / 218 old-metric), full=578 (structural). prompt_change:
next_slice=31ms during the encode (ticks never stall, branch 5 works);
prompt ack 156->313ms (encode now shares GPU with ticks — right trade).
Final gates at tip 0539d71: tests/unit 149 passed; golden 12/12 in 4:48.
STACK COMPLETE, local only — pushes/PRs need Ryan's per-action approval.