Skip to content

autoresearch: profile run 20260420T010050Z (9a18569)#1

Draft
chokevin wants to merge 2 commits into
mainfrom
autoresearch/profile-20260420T010050Z-9a18569
Draft

autoresearch: profile run 20260420T010050Z (9a18569)#1
chokevin wants to merge 2 commits into
mainfrom
autoresearch/profile-20260420T010050Z-9a18569

Conversation

@chokevin
Copy link
Copy Markdown
Owner

Autoresearch run 20260420T010050Z

  • source SHA: 9a18569
  • GPU: NVIDIA A100-SXM4-80GB (cc 8.0, 79.3 GB)
  • CUDA / torch / triton: 12.4 / 2.4.0a0+07cecf4168.nv24.05 / 3.0.0
  • shapes: voice impls: fp16,marlin repeats: 5
  • marlin SHA: 1f25790bdd49fba53106164a24666dade68d7c90

Results

shape impl ms_mean ms_p95 TFLOPS speedup vs fp16 error
8b-b1 fp16 0.031 0.033 1.1 x1.00
8b-b1 marlin 0.049 0.050 0.7 x0.64
8b-b4 fp16 0.031 0.032 4.3 x1.00
8b-b4 marlin 0.049 0.050 2.7 x0.63
8b-b8 fp16 0.031 0.032 8.6 x1.00
8b-b8 marlin 0.050 0.052 5.4 x0.62
70b-tp2-b1 fp16 0.050 0.055 1.3 x1.00
70b-tp2-b1 marlin 0.050 0.050 1.4 x1.01
70b-tp2-b4 fp16 0.049 0.049 5.5 x1.00
70b-tp2-b4 marlin 0.049 0.050 5.4 x0.99
70b-tp2-b8 fp16 0.049 0.049 10.9 x1.00
70b-tp2-b8 marlin 0.049 0.049 10.9 x1.00

roofline

source SHA: 9a18569
shapes:     voice
impls:      fp16,marlin
repeats:    5
GPU:        NVIDIA A100-SXM4-80GB
marlin SHA: 1f25790bdd49fba53106164a24666dade68d7c90

Headline (8b-b1 marlin): 0.7 TFLOPS
@chokevin chokevin added the autoresearch Created by swordfish-autoresearch chart runs label Apr 20, 2026
chokevin added a commit that referenced this pull request Apr 20, 2026
Iter-5 cluster smoke caught: ncu in PR #1 produced six CSVs containing
zero kernel rows because of ERR_NVGPUCTRPERM. NVIDIA gates GPU
performance-counter access via the NVreg_RestrictProfilingToAdminUsers
kernel-module flag; the supported in-container bypass is SYS_ADMIN.

Without ncu we have wall-clock timing and a Perfetto trace, but no
dram__throughput / hmma pct / scheduler stalls — i.e. we cannot
classify Marlin kernels into the bottleneck taxonomy in
docs/profiling/marlin-bottlenecks.md, which is the W1 exit criterion.

Gated by run.allowGpuPerfCounters (default true) so clusters where
SYS_ADMIN is forbidden by policy can opt out and accept the analysis
gap. Scoped to the profiler container only — the pod has no host
mounts so SYS_ADMIN is contained to the GPU profiling syscalls.
chokevin added a commit that referenced this pull request Apr 20, 2026
Filled in docs/profiling/marlin-bottlenecks.md from PR #1 wall-clock
data and first-principles roofline arithmetic. Key finding (and a
surprise vs the roadmap's original assumption that we'd be HBM-bound):

  Marlin sits at a flat ~49us floor across all voice-decode shapes
  (M=1..8, N=4096-8192, K=4096), independent of batch size. That floor
  is ~12x the HBM lower bound for INT4 weight loads (4.1us at 8b).
  Marlin is launch / fixed-overhead bound, NOT memory- or compute-bound.
  At 8b shapes Marlin LOSES to fp16 cuBLAS (x0.62) because cuBLAS hits
  a lower fixed floor (~31us). At 70b-tp2 shapes both impls tie at the
  ~49us floor.

This rewrites W2+'s attack surface from 'better in-loop throughput' to
'reduce per-call overhead': persistent kernel + pre-baked workspace +
fused scale prologue + CUDA Graph capture of the whole decode call.
A larger-batch sweep (M=32..128) is deferred — at those shapes we
expect to enter the HBM-bound regime; W1's job was to identify the
regime here, and the regime is launch-overhead.

Also: documented the DCGM-vs-ncu constraint on voice-agent-flex. ncu
counters were blocked by cluster-wide DCGM monitoring (we tried
SYS_ADMIN; it's not the kernel-module gate, it's the runtime
exclusivity). Bottleneck classification doesn't depend on ncu data
because the dominant signal is unambiguous from timing alone. For
W2+ deep-dives we'll need ops to dcgmi profile --pause for the run
window. Tracked as a TODO in the doc.

Updated ROADMAP W1 checklist to reflect what shipped and the surprise
finding. Added a 5th category (launch / fixed-overhead bound) to the
bottleneck taxonomy in marlin-bottlenecks.md so future analyses can
classify into it.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

autoresearch Created by swordfish-autoresearch chart runs

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant