Ryanontheinside/feat/latency/06 near playhead repatch by ryanontheinside · Pull Request #240 · daydreamlive/DEMON

ryanontheinside · 2026-06-10T02:50:25Z

Knob-to-ear latency stack — plan + progress

Untracked working file. Source review: 2026-06-09 (memory: knob-to-ear-code-levers).
Base: main @ dc06b19. Stack is linear; each branch is based on the previous.
Gates per branch: .venv/Scripts/python.exe -m pytest tests/unit + typecheck of touched files.
Golden harness: ONLY at the end, on the stack tip, with Ryan's explicit OK.
No pushes / PRs without Ryan's explicit per-action approval.
Out of scope: shadow migration (lives in DEMON_alt2 / exp branches, own landing plan).

Stack (in order)

ryanontheinside/perf/latency/01-cuda-event-timing — [DONE: pending]
diffusion_backend.py:102-119 two full-device cuda.synchronize per tick exist only
to time last_tick_ms; ace_backend.py render_window has a third before .cpu().
Replace with CUDA events (lazy elapsed_time read; one-tick-stale timing is fine —
verify nothing reads last_tick_ms same-tick for logic).
ryanontheinside/perf/latency/02-shared-curve-cache — [DONE: pending]
stream.py set_shared_curve stores CPU fp32; _eff_shared re-normalizes every read;
_resolve_slot_curves .to(device) per slot per step. Pre-normalize + device-cast at
the setter (precedent: set_channel_gain_tensor), lazy-cast if device unknown at set
time. Slot-field fallback path keeps read-time normalize.
ryanontheinside/perf/latency/03-x0-gate-no-readback — [DONE: pending]
stream.py:1226-1231 eff_strength.abs().any().item() = host fence per slot per step.
Compute nonzero-flag once at set_shared_curve / slot init; read the cached bool.
ryanontheinside/perf/latency/04-batched-small-h2d — [DONE: pending]
stream.py:721-723 per-element timestep writes -> one pinned staging copy_.
model_adapter.py:166-173 per-forward torch.tensor(timesteps) + torch.ones attn
mask -> cached/reused buffers.
ryanontheinside/perf/latency/05-prompt-encode-unlocked — [DONE: pending]
streaming/session.py set_prompt (1396), timbre/structure apply (1212, 1660),
prompt_blend swap (966-990): GPU encode runs under state._lock while the runner
takes the same lock every tick (ace_backend read_knobs:541, has_pending_refit:566).
Encode outside the lock, swap under it; generation counter so a stale encode never
overwrites a newer one.
ryanontheinside/feat/latency/06-near-playhead-repatch — [DONE: pending]
pipeline_runner.py: after a fresh produce, additionally render+patch a window at
playhead + transit margin (not just playhead + lead, floor 0.25s @ :239). Latent
state covers all positions (gap-fill proves it). Kills the lead floor as the
audibility floor for knob changes. Mind the monotonic-decode comment (:245-248) —
repatch is a deliberate, separate write path. Default behavior decided here;
measure with golden latency report.

Parity traps

NEVER remove/reorder randn_like calls (RNG draw order parity; see memory).
Curve pre-cast must produce byte-identical math to read-time .to(dtype) casts.
ace_backend timing fields (last_tick_ms/last_dec_ms) feed the runner trace + golden
latency report; keep field semantics.

Status log

2026-06-09: plan written, stack not started.
2026-06-09: all six branches committed (01:70ebbff 02:e13f03f 03:8b08147
04:0d225e6 05:0cddfae 06:55f06c5), each green on tests/unit (149 passed).
Commit messages must go via git commit -F <file> — the settings deny
list pattern-matches message text (e.g. "*rd *", "*nc *") and multiline
commands choke the harness.
Next: GPU end-to-end on stack tip (drain parity script, live smoke,
golden harness last — Ryan approved the golden run 2026-06-09).
2026-06-09 e2e: drain parity PARITY OK (8 latents bit-identical, ODE+SDE);
full golden 12/12 PASSED at tip in 4:51. knob_step initially showed
first=218/full=578 (vs 234/594 baseline) — repatch WAS firing (109
lat_nearpatch lines, close~0.12-0.18s) but _action_audible took the
first-ARRIVING qualifying slice; repatch slices arrive later with
earlier starts. Fixed metric to min-over-slices of max(arrival,
playhead-reaches-start) (0539d71); trace line amended into 3ba48f6
(06 tip moved 55f06c5 -> 0539d71). audible_full is structurally
playhead-to-frontier distance; the lever for it is shadow migration,
not this stack. Re-validation run + final full golden pending.
2026-06-09 FINAL: corrected metric run: knob_step first=58ms (was 234
baseline / 218 old-metric), full=578 (structural). prompt_change:
next_slice=31ms during the encode (ticks never stall, branch 5 works);
prompt ack 156->313ms (encode now shares GPU with ticks — right trade).
Final gates at tip 0539d71: tests/unit 149 passed; golden 12/12 in 4:48.
STACK COMPLETE, local only — pushes/PRs need Ryan's per-action approval.

The produce bracket ran two full-device torch.cuda.synchronize() per tick and render_window/render_full a third, all purely to measure last_tick_ms/last_dec_ms. Record CUDA events around the engine step and the decode instead: the produce bracket resolves lazily at the start of the next produce (one tick stale; both readers are diagnostics), and the decode bracket resolves right after the waveform's D2H copy, which already drained the stream. Removes every measurement-only host-device sync from the tick loop and lets CPU prep overlap GPU work.

set_shared_curve stored CPU fp32 tensors and _eff_shared re-ran normalize_curve on every read, so every shared override (the exact path the live denoise/guidance knobs ride) cost a fresh CPU alloc plus a host-to-device copy per slot per step. Canonicalize and device-cast once at the setter (same approach set_channel_gain_tensor already uses), return dict hits directly, and memoize normalized SlotRequest scalar fields once per slot in _Slot.curve_cache. Dtype casting stays at the consumer boundary, so the math is byte-identical.

The strength gate ran eff_strength.abs().any().item() per slot per step — a host-device fence in the middle of the integration loop (the old comment even called it out). Track an "any nonzero" flag alongside each shared curve at set_shared_curve (free for scalar sets, one readback per knob write for tensor sets) and memoize the slot-field flag once per slot, computed without tensor ops for Python-scalar fields. The gate decision is unchanged for every input; the strength tensor itself is only fetched on the path that actually blends.

TRT path: the timestep fill wrote one Python scalar per row straight into the device buffer (B tiny H2D writes per forward, doubled under CFG). Stage the rows in a pinned host buffer that lives in the shape-keyed bufs cache and issue a single async copy; ordering with the TRT exec matches the other input copies (legacy default stream vs the blocking polygraphy stream). Eager path: every forward allocated a fresh timestep tensor (pageable H2D) and a fresh all-ones attention mask kernel. Reuse both from tiny per-shape caches on the adapter; values and dtypes are unchanged.

set_prompt, the timbre applies, and clear_timbre held state._lock for the full duration of their GPU encodes (text encoder x2 + VAE for timbre). The runner acquires the same lock at the top of every tick (read_knobs / has_pending_refit), so every prompt or timbre change stalled the entire tick loop — no ticks, no gap-fill — for the encoder's duration. Restructure those flows as snapshot-encode-commit: inputs are read under state._lock, the encodes run unlocked (serialized against each other by the new state._encode_lock, preserving the old encodes-do-not-overlap property), and the commit re-checks the new state.cond_epoch, re-snapshotting and re-encoding if another conditioning commit landed meanwhile (so concurrent prompt/timbre/swap effects all land, matching the old serialized end state). The swap commit keeps its atomic locked shape and bumps the epoch. Timbre apply no longer needs rollback state: nothing mutates before the commit.

The frontier write lands at the adaptive lead, which is floored (0.25s default) for stall safety, and the region between the playhead and that lead is never otherwise rewritten — so the lead floor was the hard audibility floor for every control change regardless of how fast the engine reacted (the measured knob-to-ear was dominated by it). After a real generation (mode=="generate", fresh result) and only within 2s of inbound activity, render a second window from the new latent at playhead + (interval_ema * gain + safety_margin) — the lead formula without the floor and stall bump — and patch it in with the same crossfade/clamp behavior as the frontier write. The frontier write keeps covering the buffer through stalls exactly as before; the close write is opportunistic, so a late landing just leaves the valid older audio in place. Loop-band aware (wraps the target inside an armed band, clamps at B, skips sub-window bands); walk mode passes no band. Does not feed _note_decode_gap (second write in the same tick, like the band-wrap render). Costs one extra windowed decode (~2.4ms) per active-generation tick; idle and untouched sessions pay nothing.

_action_audible took the FIRST qualifying slice in arrival order, which equaled the earliest-audible one only while slice starts advanced monotonically. The near-playhead re-patch emits a second, closer-to-playhead slice a few ms after each frontier write, so the first-arriving qualifying slice is now the FARTHEST one — the metric reported the old lead-floor number (~218ms) while re-patched content sat ~120ms from the playhead. Take min over qualifying slices of max(arrival, playhead-reaches-start); identical for monotonic streams, and a late-arriving slice cannot game it (arrival is inside the max).

BuffMcBigHuge

Full review — read against #238 (`marco/feat/latency`)

Both PRs came out of independent same-week latency reviews of the same stack, so I diffed both branches side by side against the common base dc06b19. Summary: roughly half of each PR is the same three levers implemented twice; the other halves are disjoint and compose. Net of a merge, almost everything in both survives — details and a suggested reconciliation below.

Where we built the same thing twice

1. x0_target_strength gate readback (8b08147 vs #238 §4.4)
Same diagnosis (per-slot-per-step .abs().any().item() fence in the integration loop), same fix (set-time flags). Yours is the better cut: _curve_nonzero is generalized over curve names, the slot-field flag memoizes lazily and costs no tensor op for Python-scalar fields; ours special-cases x0 with an eager slot-init flag. Take yours; ours drops.

2. Shared-curve hot path (e13f03f vs #238 §4.4)
Same target, different halves of the fix:

Yours: device-cast at the setter, plus slot.curve_cache for the slot-field fallback path (our version still re-normalizes that path on every read).
Ours: setter dedup — re-setting the same scalar / tensor object is a no-op, which matters because ace_backend pushes the full knob state every tick, so without dedup the setter still re-normalizes once per tick — plus a read-side device+dtype cast cache, and an ace_backend SDE-curve rebuild key (amplitude, periodicity, src_T) that makes the dedup actually fire for the tensor curve.
Merged shape: your setter device-cast + curve_cache, our setter dedup + SDE-curve key. On dtype: you leave the per-step .to(dtype) at the consumer boundary (still an alloc per step when dtypes differ); we cached that cast too — worth keeping exactly one mechanism for it in the merge.

3. Timestep H2D in _trt_forward (0d225e6 vs #238 §4.5)
Yours is strictly better: pinned staging buffer cached in bufs + non_blocking copy, vs our fresh pageable torch.tensor per forward. (Our own writeup listed pinned+async as further-proposal #5 — you built it.) You also covered the eager path in model_adapter, which we didn't touch. Take yours. One invariant worth recording: reusing the pinned host buffer across forwards is safe because the per-tick emission D2H drains the stream before the next tick writes it again — if emission ever stops being per-tick, this needs double-buffering.

4. Host syncs / timing — two non-overlapping halves of the same problem.
70ebbff replaces the measurement syncs (the two per-tick full-device torch.cuda.synchronize() in diffusion_backend + the decode bracket in ace_backend) with CUDA events, but keeps the TRT enqueue syncs. #238 §4.3 removes the enqueue syncs (_trt_stream.synchronize() in _trt_forward and _trt_vae_decode, same blocking-stream ordering argument your commit message makes) but keeps the measurement syncs. Each PR removed exactly the syncs the other kept. Merged, the tick has no host sync left except the natural emission .cpu() — and your events fix the caveat we documented (with #238 alone, last_tick_ms degrades to enqueue time; with your events it's true GPU time again, which matters because the lead controller and golden trace consume it). Your lazy event resolution assumes the render's D2H already drained the stream — that still holds under #238's sync removals, so the composition is sound.

What only #240 has — the half we didn't build

Near-playhead repatch (3ba48f6) is the headline. Our review called the playback lead "the dominant audible term" but only proposed lead-floor tuning (with the underrun risk that implies); the repatch is the better idea — keep the floored frontier write for stall safety, opportunistically rewrite close-in, let a late landing lose harmlessly. And the two PRs compose multiplicatively here: #238 shortens knob→fresh-latent (queue_cap=1: −5 ticks measured on per-slot params at d4/s8; same-tick delivery: −1 tick on every generation), this PR shortens fresh-latent→ear (~160 ms of buffer distance per your 218→58 ms numbers). On a merged tip, an ODE denoise change gets both.
Encodes off state._lock (0cddfae): our further-proposal #8 territory, done properly. The snapshot/encode/commit + cond_epoch retry design reads correctly to me. Two small notes: (a) set_prompt writes state.time_signature during the snapshot phase, so a 5×-lost prompt still leaves the ts override applied — probably fine, just asymmetric with the rest of the commit; (b) the 5-strikes drop logs and skips PromptApplied, so clients waiting on the ack must tolerate a missing event (rare in practice, worth knowing).
Golden metric fix (0539d71): necessary the moment slice starts stop being arrival-monotonic, and min-over-max(arrival, reach) can't be gamed by late arrivals. Good catch — the e2e harness from #238 has the same first-qualifying-slice assumption in its knob→fresh-generation metric and will need the same fix.
Eager-path buffer reuse in model_adapter (pinned timestep pair + cached attn-ones).

What only #238 has

queue_cap=1 (the single biggest engine-side win), same-tick latent delivery, the TRT enqueue sync removals, the steering zero-fill dirty flag, the schedule-cache LRU bound, the SDE-curve rebuild cache, and the tests/e2e_latency/ harness (engine-level knob→latent change detection + headless full-stack timing) — which complements your golden-runner work as the build-to-build diffing tool.

Merge mechanics

git merge-tree of the two heads: only acestep/engine/stream.py conflicts — everything else (including ace_backend.py, which both touch) auto-merges. The conflicting hunks are exactly the three duplicated levers above, and the resolution writes itself given the picks: your _curve_nonzero + pinned timestep staging + curve_cache, our setter dedup + queue_cap + same-tick delivery + TRT sync removals + LRU + steering flag.

Suggested order: #238 already has an approval — land it, rebase this stack on it. 01 (events), 05 (session), 06 (repatch), 07 (golden) rebase clean since #238 doesn't touch those files; 02/03/04 shrink to the deltas above. Happy to do the stream.py reconciliation from either side.

One caveat on numbers: both PRs measured against base independently — your 58 ms audible_first doesn't include #238's queue-cap/same-tick wins, our 172 ms knob→fresh-generation doesn't include your repatch. The merged tip deserves a fresh golden + e2e run; the bit-exactness probes from both PRs should still pass, since neither side reorders RNG draws.

Great work — genuinely complementary stacks.

BuffMcBigHuge · 2026-06-11T02:08:20Z

@ryanontheinside — full comparison of this stack against #238 (marco/feat/latency). I diffed both branches against dc06b19 and ran a test merge.

TL;DR

Same week, same stack, two independent reviews. Roughly half overlaps (same diagnosis, different implementations); half is disjoint and composes cleanly. A merged tip keeps almost everything from both PRs. Only acestep/engine/stream.py conflicts — everything else auto-merges.

Duplicated (pick one implementation per lever)

Lever	#240	#238	Recommendation
x0 gate readback	`_curve_nonzero` — generalized, lazy, no tensor op for scalar fields	x0-only eager slot flag	Take #240
Shared-curve hot path	device-cast at setter + `slot.curve_cache`	setter dedup (same scalar/tensor object = no-op) + SDE rebuild key in `ace_backend` + read-side dtype cache	Merge both halves — #240 fixes slot-field re-normalize; #238 fixes per-tick re-push when knobs are steady
Timestep H2D	pinned staging in `bufs` + `non_blocking`; eager path in `model_adapter`	pageable single `torch.tensor` copy	Take #240 (our writeup listed pinned+async as a future proposal — you built it)
Host syncs	removes measurement syncs → CUDA events	removes enqueue syncs in TRT forward/VAE	Both — each PR removed exactly what the other kept; merged = zero host sync in the tick loop, and your events restore true GPU timing for the lead controller

Only in #240 (we didn't build these)

Near-playhead repatch — the better answer to the playback-lead audibility floor we identified but only proposed floor-tuning for. Frontier write keeps stall safety; close write is opportunistic. Composes with our engine wins.
Prompt/timbre encodes off state._lock — snapshot/encode/commit + cond_epoch retry. Our further-proposal trt: 120s engine profile + existence-aware picker #8 territory, done properly.
Golden audible_first fix — necessary once slice starts stop being arrival-monotonic; min-over-max(arrival, reach) is the right metric.
Eager-path buffer reuse in model_adapter.

Your numbers (218→58 ms audible_first) were measured without our queue-cap / same-tick wins — merged tip deserves a fresh golden run.

Only in #238 (you didn't build these)

queue_cap=1 — biggest per-slot engine win (−5 ticks measured at d4/s8)
Same-tick latent delivery — −1 tick on every generation
TRT enqueue sync removals (_trt_forward, _trt_vae_decode)
Steering zero-fill dirty flag, schedule-cache LRU bound
tests/e2e_latency/ harness — engine-level + headless full-stack timing for build-to-build diffing (complements your golden work; note: our first-qualifying-slice metric will need your golden fix once repatch lands)

Suggested landing

Land Latency Improvement Experimentation #238 first (already approved).
Rebase this stack on it — commits 01/05/06/07 should rebase clean; 02/03/04 shrink to the deltas above.
Hand-resolve stream.py using the picks in the table.
Fresh golden + e2e on the merged tip.

Happy to do the stream.py reconciliation from either side. Bit-exactness probes from both PRs should still pass — neither side reorders RNG draws.

ryanontheinside added 7 commits June 9, 2026 17:49

ryanontheinside marked this pull request as draft June 10, 2026 02:50

BuffMcBigHuge reviewed Jun 11, 2026

View reviewed changes

gioelecerati mentioned this pull request Jun 11, 2026

fix(rtmg/web): bulk sends freeze all reads — chunked sends + write_audio recv hardening #245

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ryanontheinside/feat/latency/06 near playhead repatch#240

Ryanontheinside/feat/latency/06 near playhead repatch#240
ryanontheinside wants to merge 7 commits into
mainfrom
ryanontheinside/feat/latency/06-near-playhead-repatch

ryanontheinside commented Jun 10, 2026

Uh oh!

BuffMcBigHuge left a comment

Uh oh!

BuffMcBigHuge commented Jun 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ryanontheinside commented Jun 10, 2026

Knob-to-ear latency stack — plan + progress

Stack (in order)

Parity traps

Status log

Uh oh!

BuffMcBigHuge left a comment

Choose a reason for hiding this comment

Full review — read against #238 (marco/feat/latency)

Where we built the same thing twice

What only #240 has — the half we didn't build

What only #238 has

Merge mechanics

Uh oh!

BuffMcBigHuge commented Jun 11, 2026

TL;DR

Duplicated (pick one implementation per lever)

Only in #240 (we didn't build these)

Only in #238 (you didn't build these)

Suggested landing

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Full review — read against #238 (`marco/feat/latency`)