Skip to content

Ryanontheinside/perf/throughput/07 bench stream tick#241

Draft
ryanontheinside wants to merge 7 commits into
mainfrom
ryanontheinside/perf/throughput/07-bench-stream-tick
Draft

Ryanontheinside/perf/throughput/07 bench stream tick#241
ryanontheinside wants to merge 7 commits into
mainfrom
ryanontheinside/perf/throughput/07-bench-stream-tick

Conversation

@ryanontheinside

Copy link
Copy Markdown
Collaborator

~3.7ms/tick under CFG plus the correctness fix

trt.IProfiler harness over a decoder engine at production stream
shapes. Reports unprofiled wall time, per-layer averages, and a
category rollup (attention/gemm/norm/pointwise/myelin) so precision
and fusion work can be aimed at the actual hot layers instead of
inferred from whole-engine timing. Flags opaque Myelin ForeignNode
blobs rather than miscounting them.
The fast-path gate in _tick_pt called eff_strength.abs().any().item()
per slot per step — a host-device fence whose only purpose was a
yes/no. Compute that bool once where the value is actually set: at
set_shared_curve (companion _shared_curves_nonzero dict) and at slot
init for the per-request field (immutable after submit). The strength
tensor itself is now materialized only inside the blend branch that
consumes it. Gate decisions are byte-identical; the per-step readback
is gone.
Under CFG every tick ran two serialized batched forwards, each with
its own enqueue + host sync. When the combined pos+neg pair count
fits the engine profile's batch ceiling (snapshotted from the
optimization profile, same read as _compute_max_pipeline_depth), the
two passes now run as ONE execute and the output rows are split.
Two-pass remains as the fallback when the combined batch exceeds the
profile (e.g. full depth with full CFG on a b8 engine).

This also fixes silently-unguided full CFG on engines whose output
dtype equals the pipeline dtype (the production bf16-hybrid "mixed"
turbo build): _trt_forward's post-execute cast is a no-op view of
the shared TRT output buffer there, so on the old two-pass path the
neg execute overwrote vt_pos_all in place and APG computed guidance
from pos == neg. Verified on a 5090: the engine is bit-exact per row
at B=8 vs B=4 (zero cross-row contamination), forced two-pass
reproduces main bit-exactly, and fused output restores real
guidance. The fallback path gets an explicit clone before the second
execute for the same reason; "initialize"-mode RCFG clones its
vt_neg_cached at the cache write (that cache outlives the tick).

Numerics note: fused neg rows pad to the combined encoder max_L. The
decoder discards attention masks and attends zero rows by convention
(see DecoderForExport), so this matters only when pos/neg max_L
differ; production null_conditioning expands the null embedding to
the positive's L, making the fused batch length-uniform. The
ace_adapter_parity blessed capture is regenerated for the CPU-tier
cfg_full scenario (unequal L there by construction; only that
scenario moves, max_diff 4.1e-3, all seven others bit-identical).
ACEAdapter.batched_forward re-padded and re-concatenated encoder
states, masks, and context latents on every forward (twice per tick
under unfused CFG) even though they are frozen per (slot, condition)
and ticks repeat the same pair-sets for a slot's whole schedule. The
pad/cat result is now memoized in a small LRU keyed on the identity
of the input tensor lists; entries hold strong refs to their sources
so the keyed ids cannot be recycled while cached. Cache hits return
the exact tensors the rebuild would have produced, so outputs are
bit-identical. Also stops mutating the caller's enc/mask lists in
place.
_trt_forward ended every execute with _trt_stream.synchronize(),
blocking the host for the full engine execution — once per tick, and
twice under unfused CFG. The host join is replaced by an event
recorded on the (torch-wrapped) polygraphy stream that torch's
current stream waits on, so the output cast and everything the tick
enqueues afterwards stays correctly ordered on-device while the CPU
overlaps the engine: per-slot velocity assembly, APG, integration,
and the next pass's input copies all enqueue under the running
forward.

Ordering notes:
- Input-side ordering (copy_ writes vs engine reads) was always
  implicit — torch's default current stream is the legacy default
  stream and the polygraphy stream is blocking, so they never run
  concurrently. Unchanged; now documented at the call site.
- Buffer-cache misses in _ensure_trt_bufs drain the TRT stream
  before allocating/evicting, since an in-flight forward may still
  read an evicted entry's memory once freed to the allocator.
  Shape changes are transition-rare; steady-state stays sync-free.
- Host reads downstream (.item()/.cpu()) sync the current stream,
  which now waits on the completion event, so they remain correct.
The integration loop launched one kernel chain per slot for the fast
Euler path (plus a sentinel randn_like whose product is exactly
zero). Rows with sentinel curves and DCW inactive now accumulate
during the loop and integrate after it in two multi-tensor launches
(torch._foreach_mul with host-side dt scalars, then
torch._foreach_add on the per-slot views) — no H2D copy, no cat, no
shared output buffer.

The arithmetic is the exact fast-path expression: bit-identical to
the eager per-slot kernels (parity rail confirms all eight blessed
scenarios torch.equal), possibly bf16-LSB different from the
inductor-compiled variant where mul+add fused. Skipping the sentinel
randn_like also skips its RNG advance; safe because slot noise is
drawn per-request from explicit seeds, never from ambient generator
state mid-schedule. Slots with velocity_scale / ode_noise_curve /
DCW keep the existing per-slot kernels.
Drives Session.stream + StreamDenoise (the production tick path) at a
fixed depth and reports ms/tick, finished latents/sec, and audio-x-
realtime, with CFG and DCW toggles. Latent-only: isolates the DiT
tick loop the stream is gated on. Run identical args on two revisions
to compare them.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant