Skip to content

Ryanontheinside/feat/latency/06 near playhead repatch#240

Draft
ryanontheinside wants to merge 7 commits into
mainfrom
ryanontheinside/feat/latency/06-near-playhead-repatch
Draft

Ryanontheinside/feat/latency/06 near playhead repatch#240
ryanontheinside wants to merge 7 commits into
mainfrom
ryanontheinside/feat/latency/06-near-playhead-repatch

Conversation

@ryanontheinside

Copy link
Copy Markdown
Collaborator

Knob-to-ear latency stack — plan + progress

Untracked working file. Source review: 2026-06-09 (memory: knob-to-ear-code-levers).
Base: main @ dc06b19. Stack is linear; each branch is based on the previous.
Gates per branch: .venv/Scripts/python.exe -m pytest tests/unit + typecheck of touched files.
Golden harness: ONLY at the end, on the stack tip, with Ryan's explicit OK.
No pushes / PRs without Ryan's explicit per-action approval.
Out of scope: shadow migration (lives in DEMON_alt2 / exp branches, own landing plan).

Stack (in order)

  1. ryanontheinside/perf/latency/01-cuda-event-timing — [DONE: pending]
    diffusion_backend.py:102-119 two full-device cuda.synchronize per tick exist only
    to time last_tick_ms; ace_backend.py render_window has a third before .cpu().
    Replace with CUDA events (lazy elapsed_time read; one-tick-stale timing is fine —
    verify nothing reads last_tick_ms same-tick for logic).

  2. ryanontheinside/perf/latency/02-shared-curve-cache — [DONE: pending]
    stream.py set_shared_curve stores CPU fp32; _eff_shared re-normalizes every read;
    _resolve_slot_curves .to(device) per slot per step. Pre-normalize + device-cast at
    the setter (precedent: set_channel_gain_tensor), lazy-cast if device unknown at set
    time. Slot-field fallback path keeps read-time normalize.

  3. ryanontheinside/perf/latency/03-x0-gate-no-readback — [DONE: pending]
    stream.py:1226-1231 eff_strength.abs().any().item() = host fence per slot per step.
    Compute nonzero-flag once at set_shared_curve / slot init; read the cached bool.

  4. ryanontheinside/perf/latency/04-batched-small-h2d — [DONE: pending]
    stream.py:721-723 per-element timestep writes -> one pinned staging copy_.
    model_adapter.py:166-173 per-forward torch.tensor(timesteps) + torch.ones attn
    mask -> cached/reused buffers.

  5. ryanontheinside/perf/latency/05-prompt-encode-unlocked — [DONE: pending]
    streaming/session.py set_prompt (1396), timbre/structure apply (1212, 1660),
    prompt_blend swap (966-990): GPU encode runs under state._lock while the runner
    takes the same lock every tick (ace_backend read_knobs:541, has_pending_refit:566).
    Encode outside the lock, swap under it; generation counter so a stale encode never
    overwrites a newer one.

  6. ryanontheinside/feat/latency/06-near-playhead-repatch — [DONE: pending]
    pipeline_runner.py: after a fresh produce, additionally render+patch a window at
    playhead + transit margin (not just playhead + lead, floor 0.25s @ :239). Latent
    state covers all positions (gap-fill proves it). Kills the lead floor as the
    audibility floor for knob changes. Mind the monotonic-decode comment (:245-248) —
    repatch is a deliberate, separate write path. Default behavior decided here;
    measure with golden latency report.

Parity traps

  • NEVER remove/reorder randn_like calls (RNG draw order parity; see memory).
  • Curve pre-cast must produce byte-identical math to read-time .to(dtype) casts.
  • ace_backend timing fields (last_tick_ms/last_dec_ms) feed the runner trace + golden
    latency report; keep field semantics.

Status log

  • 2026-06-09: plan written, stack not started.
  • 2026-06-09: all six branches committed (01:70ebbff 02:e13f03f 03:8b08147
    04:0d225e6 05:0cddfae 06:55f06c5), each green on tests/unit (149 passed).
    Commit messages must go via git commit -F <file> — the settings deny
    list pattern-matches message text (e.g. "*rd *", "*nc *") and multiline
    commands choke the harness.
  • Next: GPU end-to-end on stack tip (drain parity script, live smoke,
    golden harness last — Ryan approved the golden run 2026-06-09).
  • 2026-06-09 e2e: drain parity PARITY OK (8 latents bit-identical, ODE+SDE);
    full golden 12/12 PASSED at tip in 4:51. knob_step initially showed
    first=218/full=578 (vs 234/594 baseline) — repatch WAS firing (109
    lat_nearpatch lines, close~0.12-0.18s) but _action_audible took the
    first-ARRIVING qualifying slice; repatch slices arrive later with
    earlier starts. Fixed metric to min-over-slices of max(arrival,
    playhead-reaches-start) (0539d71); trace line amended into 3ba48f6
    (06 tip moved 55f06c5 -> 0539d71). audible_full is structurally
    playhead-to-frontier distance; the lever for it is shadow migration,
    not this stack. Re-validation run + final full golden pending.
  • 2026-06-09 FINAL: corrected metric run: knob_step first=58ms (was 234
    baseline / 218 old-metric), full=578 (structural). prompt_change:
    next_slice=31ms during the encode (ticks never stall, branch 5 works);
    prompt ack 156->313ms (encode now shares GPU with ticks — right trade).
    Final gates at tip 0539d71: tests/unit 149 passed; golden 12/12 in 4:48.
    STACK COMPLETE, local only — pushes/PRs need Ryan's per-action approval.

The produce bracket ran two full-device torch.cuda.synchronize() per
tick and render_window/render_full a third, all purely to measure
last_tick_ms/last_dec_ms. Record CUDA events around the engine step and
the decode instead: the produce bracket resolves lazily at the start of
the next produce (one tick stale; both readers are diagnostics), and
the decode bracket resolves right after the waveform's D2H copy, which
already drained the stream. Removes every measurement-only host-device
sync from the tick loop and lets CPU prep overlap GPU work.
set_shared_curve stored CPU fp32 tensors and _eff_shared re-ran
normalize_curve on every read, so every shared override (the exact
path the live denoise/guidance knobs ride) cost a fresh CPU alloc plus
a host-to-device copy per slot per step. Canonicalize and device-cast
once at the setter (same approach set_channel_gain_tensor already
uses), return dict hits directly, and memoize normalized SlotRequest
scalar fields once per slot in _Slot.curve_cache. Dtype casting stays
at the consumer boundary, so the math is byte-identical.
The strength gate ran eff_strength.abs().any().item() per slot per
step — a host-device fence in the middle of the integration loop (the
old comment even called it out). Track an "any nonzero" flag alongside
each shared curve at set_shared_curve (free for scalar sets, one
readback per knob write for tensor sets) and memoize the slot-field
flag once per slot, computed without tensor ops for Python-scalar
fields. The gate decision is unchanged for every input; the strength
tensor itself is only fetched on the path that actually blends.
TRT path: the timestep fill wrote one Python scalar per row straight
into the device buffer (B tiny H2D writes per forward, doubled under
CFG). Stage the rows in a pinned host buffer that lives in the
shape-keyed bufs cache and issue a single async copy; ordering with
the TRT exec matches the other input copies (legacy default stream vs
the blocking polygraphy stream).

Eager path: every forward allocated a fresh timestep tensor (pageable
H2D) and a fresh all-ones attention mask kernel. Reuse both from tiny
per-shape caches on the adapter; values and dtypes are unchanged.
set_prompt, the timbre applies, and clear_timbre held state._lock for
the full duration of their GPU encodes (text encoder x2 + VAE for
timbre). The runner acquires the same lock at the top of every tick
(read_knobs / has_pending_refit), so every prompt or timbre change
stalled the entire tick loop — no ticks, no gap-fill — for the
encoder's duration.

Restructure those flows as snapshot-encode-commit: inputs are read
under state._lock, the encodes run unlocked (serialized against each
other by the new state._encode_lock, preserving the old
encodes-do-not-overlap property), and the commit re-checks the new
state.cond_epoch, re-snapshotting and re-encoding if another
conditioning commit landed meanwhile (so concurrent prompt/timbre/swap
effects all land, matching the old serialized end state). The swap
commit keeps its atomic locked shape and bumps the epoch. Timbre apply
no longer needs rollback state: nothing mutates before the commit.
The frontier write lands at the adaptive lead, which is floored
(0.25s default) for stall safety, and the region between the playhead
and that lead is never otherwise rewritten — so the lead floor was the
hard audibility floor for every control change regardless of how fast
the engine reacted (the measured knob-to-ear was dominated by it).

After a real generation (mode=="generate", fresh result) and only
within 2s of inbound activity, render a second window from the new
latent at playhead + (interval_ema * gain + safety_margin) — the lead
formula without the floor and stall bump — and patch it in with the
same crossfade/clamp behavior as the frontier write. The frontier
write keeps covering the buffer through stalls exactly as before; the
close write is opportunistic, so a late landing just leaves the valid
older audio in place. Loop-band aware (wraps the target inside an
armed band, clamps at B, skips sub-window bands); walk mode passes no
band. Does not feed _note_decode_gap (second write in the same tick,
like the band-wrap render). Costs one extra windowed decode (~2.4ms)
per active-generation tick; idle and untouched sessions pay nothing.
_action_audible took the FIRST qualifying slice in arrival order,
which equaled the earliest-audible one only while slice starts
advanced monotonically. The near-playhead re-patch emits a second,
closer-to-playhead slice a few ms after each frontier write, so the
first-arriving qualifying slice is now the FARTHEST one — the metric
reported the old lead-floor number (~218ms) while re-patched content
sat ~120ms from the playhead. Take min over qualifying slices of
max(arrival, playhead-reaches-start); identical for monotonic streams,
and a late-arriving slice cannot game it (arrival is inside the max).
@ryanontheinside ryanontheinside marked this pull request as draft June 10, 2026 02:50

@BuffMcBigHuge BuffMcBigHuge left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Full review — read against #238 (marco/feat/latency)

Both PRs came out of independent same-week latency reviews of the same stack, so I diffed both branches side by side against the common base dc06b19. Summary: roughly half of each PR is the same three levers implemented twice; the other halves are disjoint and compose. Net of a merge, almost everything in both survives — details and a suggested reconciliation below.

Where we built the same thing twice

1. x0_target_strength gate readback (8b08147 vs #238 §4.4)
Same diagnosis (per-slot-per-step .abs().any().item() fence in the integration loop), same fix (set-time flags). Yours is the better cut: _curve_nonzero is generalized over curve names, the slot-field flag memoizes lazily and costs no tensor op for Python-scalar fields; ours special-cases x0 with an eager slot-init flag. Take yours; ours drops.

2. Shared-curve hot path (e13f03f vs #238 §4.4)
Same target, different halves of the fix:

  • Yours: device-cast at the setter, plus slot.curve_cache for the slot-field fallback path (our version still re-normalizes that path on every read).
  • Ours: setter dedup — re-setting the same scalar / tensor object is a no-op, which matters because ace_backend pushes the full knob state every tick, so without dedup the setter still re-normalizes once per tick — plus a read-side device+dtype cast cache, and an ace_backend SDE-curve rebuild key (amplitude, periodicity, src_T) that makes the dedup actually fire for the tensor curve.
  • Merged shape: your setter device-cast + curve_cache, our setter dedup + SDE-curve key. On dtype: you leave the per-step .to(dtype) at the consumer boundary (still an alloc per step when dtypes differ); we cached that cast too — worth keeping exactly one mechanism for it in the merge.

3. Timestep H2D in _trt_forward (0d225e6 vs #238 §4.5)
Yours is strictly better: pinned staging buffer cached in bufs + non_blocking copy, vs our fresh pageable torch.tensor per forward. (Our own writeup listed pinned+async as further-proposal #5 — you built it.) You also covered the eager path in model_adapter, which we didn't touch. Take yours. One invariant worth recording: reusing the pinned host buffer across forwards is safe because the per-tick emission D2H drains the stream before the next tick writes it again — if emission ever stops being per-tick, this needs double-buffering.

4. Host syncs / timing — two non-overlapping halves of the same problem.
70ebbff replaces the measurement syncs (the two per-tick full-device torch.cuda.synchronize() in diffusion_backend + the decode bracket in ace_backend) with CUDA events, but keeps the TRT enqueue syncs. #238 §4.3 removes the enqueue syncs (_trt_stream.synchronize() in _trt_forward and _trt_vae_decode, same blocking-stream ordering argument your commit message makes) but keeps the measurement syncs. Each PR removed exactly the syncs the other kept. Merged, the tick has no host sync left except the natural emission .cpu() — and your events fix the caveat we documented (with #238 alone, last_tick_ms degrades to enqueue time; with your events it's true GPU time again, which matters because the lead controller and golden trace consume it). Your lazy event resolution assumes the render's D2H already drained the stream — that still holds under #238's sync removals, so the composition is sound.

What only #240 has — the half we didn't build

  • Near-playhead repatch (3ba48f6) is the headline. Our review called the playback lead "the dominant audible term" but only proposed lead-floor tuning (with the underrun risk that implies); the repatch is the better idea — keep the floored frontier write for stall safety, opportunistically rewrite close-in, let a late landing lose harmlessly. And the two PRs compose multiplicatively here: #238 shortens knob→fresh-latent (queue_cap=1: −5 ticks measured on per-slot params at d4/s8; same-tick delivery: −1 tick on every generation), this PR shortens fresh-latent→ear (~160 ms of buffer distance per your 218→58 ms numbers). On a merged tip, an ODE denoise change gets both.
  • Encodes off state._lock (0cddfae): our further-proposal #8 territory, done properly. The snapshot/encode/commit + cond_epoch retry design reads correctly to me. Two small notes: (a) set_prompt writes state.time_signature during the snapshot phase, so a 5×-lost prompt still leaves the ts override applied — probably fine, just asymmetric with the rest of the commit; (b) the 5-strikes drop logs and skips PromptApplied, so clients waiting on the ack must tolerate a missing event (rare in practice, worth knowing).
  • Golden metric fix (0539d71): necessary the moment slice starts stop being arrival-monotonic, and min-over-max(arrival, reach) can't be gamed by late arrivals. Good catch — the e2e harness from #238 has the same first-qualifying-slice assumption in its knob→fresh-generation metric and will need the same fix.
  • Eager-path buffer reuse in model_adapter (pinned timestep pair + cached attn-ones).

What only #238 has

queue_cap=1 (the single biggest engine-side win), same-tick latent delivery, the TRT enqueue sync removals, the steering zero-fill dirty flag, the schedule-cache LRU bound, the SDE-curve rebuild cache, and the tests/e2e_latency/ harness (engine-level knob→latent change detection + headless full-stack timing) — which complements your golden-runner work as the build-to-build diffing tool.

Merge mechanics

git merge-tree of the two heads: only acestep/engine/stream.py conflicts — everything else (including ace_backend.py, which both touch) auto-merges. The conflicting hunks are exactly the three duplicated levers above, and the resolution writes itself given the picks: your _curve_nonzero + pinned timestep staging + curve_cache, our setter dedup + queue_cap + same-tick delivery + TRT sync removals + LRU + steering flag.

Suggested order: #238 already has an approval — land it, rebase this stack on it. 01 (events), 05 (session), 06 (repatch), 07 (golden) rebase clean since #238 doesn't touch those files; 02/03/04 shrink to the deltas above. Happy to do the stream.py reconciliation from either side.

One caveat on numbers: both PRs measured against base independently — your 58 ms audible_first doesn't include #238's queue-cap/same-tick wins, our 172 ms knob→fresh-generation doesn't include your repatch. The merged tip deserves a fresh golden + e2e run; the bit-exactness probes from both PRs should still pass, since neither side reorders RNG draws.

Great work — genuinely complementary stacks.

@BuffMcBigHuge

Copy link
Copy Markdown
Collaborator

@ryanontheinside — full comparison of this stack against #238 (marco/feat/latency). I diffed both branches against dc06b19 and ran a test merge.

TL;DR

Same week, same stack, two independent reviews. Roughly half overlaps (same diagnosis, different implementations); half is disjoint and composes cleanly. A merged tip keeps almost everything from both PRs. Only acestep/engine/stream.py conflicts — everything else auto-merges.


Duplicated (pick one implementation per lever)

Lever #240 #238 Recommendation
x0 gate readback _curve_nonzero — generalized, lazy, no tensor op for scalar fields x0-only eager slot flag Take #240
Shared-curve hot path device-cast at setter + slot.curve_cache setter dedup (same scalar/tensor object = no-op) + SDE rebuild key in ace_backend + read-side dtype cache Merge both halves#240 fixes slot-field re-normalize; #238 fixes per-tick re-push when knobs are steady
Timestep H2D pinned staging in bufs + non_blocking; eager path in model_adapter pageable single torch.tensor copy Take #240 (our writeup listed pinned+async as a future proposal — you built it)
Host syncs removes measurement syncs → CUDA events removes enqueue syncs in TRT forward/VAE Both — each PR removed exactly what the other kept; merged = zero host sync in the tick loop, and your events restore true GPU timing for the lead controller

Only in #240 (we didn't build these)

  • Near-playhead repatch — the better answer to the playback-lead audibility floor we identified but only proposed floor-tuning for. Frontier write keeps stall safety; close write is opportunistic. Composes with our engine wins.
  • Prompt/timbre encodes off state._lock — snapshot/encode/commit + cond_epoch retry. Our further-proposal trt: 120s engine profile + existence-aware picker #8 territory, done properly.
  • Golden audible_first fix — necessary once slice starts stop being arrival-monotonic; min-over-max(arrival, reach) is the right metric.
  • Eager-path buffer reuse in model_adapter.

Your numbers (218→58 ms audible_first) were measured without our queue-cap / same-tick wins — merged tip deserves a fresh golden run.


Only in #238 (you didn't build these)

  • queue_cap=1 — biggest per-slot engine win (−5 ticks measured at d4/s8)
  • Same-tick latent delivery — −1 tick on every generation
  • TRT enqueue sync removals (_trt_forward, _trt_vae_decode)
  • Steering zero-fill dirty flag, schedule-cache LRU bound
  • tests/e2e_latency/ harness — engine-level + headless full-stack timing for build-to-build diffing (complements your golden work; note: our first-qualifying-slice metric will need your golden fix once repatch lands)

Suggested landing

  1. Land Latency Improvement Experimentation #238 first (already approved).
  2. Rebase this stack on it — commits 01/05/06/07 should rebase clean; 02/03/04 shrink to the deltas above.
  3. Hand-resolve stream.py using the picks in the table.
  4. Fresh golden + e2e on the merged tip.

Happy to do the stream.py reconciliation from either side. Bit-exactness probes from both PRs should still pass — neither side reorders RNG draws.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants