autoresearch: profile run 20260420T010050Z (9a18569)#1
Draft
chokevin wants to merge 2 commits into
Draft
Conversation
source SHA: 9a18569 shapes: voice impls: fp16,marlin repeats: 5 GPU: NVIDIA A100-SXM4-80GB marlin SHA: 1f25790bdd49fba53106164a24666dade68d7c90 Headline (8b-b1 marlin): 0.7 TFLOPS
chokevin
added a commit
that referenced
this pull request
Apr 20, 2026
Iter-5 cluster smoke caught: ncu in PR #1 produced six CSVs containing zero kernel rows because of ERR_NVGPUCTRPERM. NVIDIA gates GPU performance-counter access via the NVreg_RestrictProfilingToAdminUsers kernel-module flag; the supported in-container bypass is SYS_ADMIN. Without ncu we have wall-clock timing and a Perfetto trace, but no dram__throughput / hmma pct / scheduler stalls — i.e. we cannot classify Marlin kernels into the bottleneck taxonomy in docs/profiling/marlin-bottlenecks.md, which is the W1 exit criterion. Gated by run.allowGpuPerfCounters (default true) so clusters where SYS_ADMIN is forbidden by policy can opt out and accept the analysis gap. Scoped to the profiler container only — the pod has no host mounts so SYS_ADMIN is contained to the GPU profiling syscalls.
chokevin
added a commit
that referenced
this pull request
Apr 20, 2026
Filled in docs/profiling/marlin-bottlenecks.md from PR #1 wall-clock data and first-principles roofline arithmetic. Key finding (and a surprise vs the roadmap's original assumption that we'd be HBM-bound): Marlin sits at a flat ~49us floor across all voice-decode shapes (M=1..8, N=4096-8192, K=4096), independent of batch size. That floor is ~12x the HBM lower bound for INT4 weight loads (4.1us at 8b). Marlin is launch / fixed-overhead bound, NOT memory- or compute-bound. At 8b shapes Marlin LOSES to fp16 cuBLAS (x0.62) because cuBLAS hits a lower fixed floor (~31us). At 70b-tp2 shapes both impls tie at the ~49us floor. This rewrites W2+'s attack surface from 'better in-loop throughput' to 'reduce per-call overhead': persistent kernel + pre-baked workspace + fused scale prologue + CUDA Graph capture of the whole decode call. A larger-batch sweep (M=32..128) is deferred — at those shapes we expect to enter the HBM-bound regime; W1's job was to identify the regime here, and the regime is launch-overhead. Also: documented the DCGM-vs-ncu constraint on voice-agent-flex. ncu counters were blocked by cluster-wide DCGM monitoring (we tried SYS_ADMIN; it's not the kernel-module gate, it's the runtime exclusivity). Bottleneck classification doesn't depend on ncu data because the dominant signal is unambiguous from timing alone. For W2+ deep-dives we'll need ops to dcgmi profile --pause for the run window. Tracked as a TODO in the doc. Updated ROADMAP W1 checklist to reflect what shipped and the surprise finding. Added a 5th category (launch / fixed-overhead bound) to the bottleneck taxonomy in marlin-bottlenecks.md so future analyses can classify into it.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Autoresearch run
20260420T010050Z9a18569voiceimpls:fp16,marlinrepeats: 51f25790bdd49fba53106164a24666dade68d7c90Results