Ryanontheinside/perf/throughput/07 bench stream tick#241
Draft
ryanontheinside wants to merge 7 commits into
Draft
Ryanontheinside/perf/throughput/07 bench stream tick#241ryanontheinside wants to merge 7 commits into
ryanontheinside wants to merge 7 commits into
Conversation
trt.IProfiler harness over a decoder engine at production stream shapes. Reports unprofiled wall time, per-layer averages, and a category rollup (attention/gemm/norm/pointwise/myelin) so precision and fusion work can be aimed at the actual hot layers instead of inferred from whole-engine timing. Flags opaque Myelin ForeignNode blobs rather than miscounting them.
The fast-path gate in _tick_pt called eff_strength.abs().any().item() per slot per step — a host-device fence whose only purpose was a yes/no. Compute that bool once where the value is actually set: at set_shared_curve (companion _shared_curves_nonzero dict) and at slot init for the per-request field (immutable after submit). The strength tensor itself is now materialized only inside the blend branch that consumes it. Gate decisions are byte-identical; the per-step readback is gone.
Under CFG every tick ran two serialized batched forwards, each with its own enqueue + host sync. When the combined pos+neg pair count fits the engine profile's batch ceiling (snapshotted from the optimization profile, same read as _compute_max_pipeline_depth), the two passes now run as ONE execute and the output rows are split. Two-pass remains as the fallback when the combined batch exceeds the profile (e.g. full depth with full CFG on a b8 engine). This also fixes silently-unguided full CFG on engines whose output dtype equals the pipeline dtype (the production bf16-hybrid "mixed" turbo build): _trt_forward's post-execute cast is a no-op view of the shared TRT output buffer there, so on the old two-pass path the neg execute overwrote vt_pos_all in place and APG computed guidance from pos == neg. Verified on a 5090: the engine is bit-exact per row at B=8 vs B=4 (zero cross-row contamination), forced two-pass reproduces main bit-exactly, and fused output restores real guidance. The fallback path gets an explicit clone before the second execute for the same reason; "initialize"-mode RCFG clones its vt_neg_cached at the cache write (that cache outlives the tick). Numerics note: fused neg rows pad to the combined encoder max_L. The decoder discards attention masks and attends zero rows by convention (see DecoderForExport), so this matters only when pos/neg max_L differ; production null_conditioning expands the null embedding to the positive's L, making the fused batch length-uniform. The ace_adapter_parity blessed capture is regenerated for the CPU-tier cfg_full scenario (unequal L there by construction; only that scenario moves, max_diff 4.1e-3, all seven others bit-identical).
ACEAdapter.batched_forward re-padded and re-concatenated encoder states, masks, and context latents on every forward (twice per tick under unfused CFG) even though they are frozen per (slot, condition) and ticks repeat the same pair-sets for a slot's whole schedule. The pad/cat result is now memoized in a small LRU keyed on the identity of the input tensor lists; entries hold strong refs to their sources so the keyed ids cannot be recycled while cached. Cache hits return the exact tensors the rebuild would have produced, so outputs are bit-identical. Also stops mutating the caller's enc/mask lists in place.
_trt_forward ended every execute with _trt_stream.synchronize(), blocking the host for the full engine execution — once per tick, and twice under unfused CFG. The host join is replaced by an event recorded on the (torch-wrapped) polygraphy stream that torch's current stream waits on, so the output cast and everything the tick enqueues afterwards stays correctly ordered on-device while the CPU overlaps the engine: per-slot velocity assembly, APG, integration, and the next pass's input copies all enqueue under the running forward. Ordering notes: - Input-side ordering (copy_ writes vs engine reads) was always implicit — torch's default current stream is the legacy default stream and the polygraphy stream is blocking, so they never run concurrently. Unchanged; now documented at the call site. - Buffer-cache misses in _ensure_trt_bufs drain the TRT stream before allocating/evicting, since an in-flight forward may still read an evicted entry's memory once freed to the allocator. Shape changes are transition-rare; steady-state stays sync-free. - Host reads downstream (.item()/.cpu()) sync the current stream, which now waits on the completion event, so they remain correct.
The integration loop launched one kernel chain per slot for the fast Euler path (plus a sentinel randn_like whose product is exactly zero). Rows with sentinel curves and DCW inactive now accumulate during the loop and integrate after it in two multi-tensor launches (torch._foreach_mul with host-side dt scalars, then torch._foreach_add on the per-slot views) — no H2D copy, no cat, no shared output buffer. The arithmetic is the exact fast-path expression: bit-identical to the eager per-slot kernels (parity rail confirms all eight blessed scenarios torch.equal), possibly bf16-LSB different from the inductor-compiled variant where mul+add fused. Skipping the sentinel randn_like also skips its RNG advance; safe because slot noise is drawn per-request from explicit seeds, never from ambient generator state mid-schedule. Slots with velocity_scale / ode_noise_curve / DCW keep the existing per-slot kernels.
Drives Session.stream + StreamDenoise (the production tick path) at a fixed depth and reports ms/tick, finished latents/sec, and audio-x- realtime, with CFG and DCW toggles. Latent-only: isolates the DiT tick loop the stream is gated on. Run identical args on two revisions to compare them.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
~3.7ms/tick under CFG plus the correctness fix