autoresearch: profile run 20260420T010050Z (9a18569) by chokevin · Pull Request #1 · chokevin/swordfish

chokevin · 2026-04-20T01:01:41Z

Autoresearch run `20260420T010050Z`

source SHA: 9a18569
GPU: NVIDIA A100-SXM4-80GB (cc 8.0, 79.3 GB)
CUDA / torch / triton: 12.4 / 2.4.0a0+07cecf4168.nv24.05 / 3.0.0
shapes: voice impls: fp16,marlin repeats: 5
marlin SHA: 1f25790bdd49fba53106164a24666dade68d7c90

Results

shape	impl	ms_mean	ms_p95	TFLOPS	speedup vs fp16
8b-b1	fp16	0.031	0.033	1.1	x1.00
8b-b1	marlin	0.049	0.050	0.7	x0.64
8b-b4	fp16	0.031	0.032	4.3	x1.00
8b-b4	marlin	0.049	0.050	2.7	x0.63
8b-b8	fp16	0.031	0.032	8.6	x1.00
8b-b8	marlin	0.050	0.052	5.4	x0.62
70b-tp2-b1	fp16	0.050	0.055	1.3	x1.00
70b-tp2-b1	marlin	0.050	0.050	1.4	x1.01
70b-tp2-b4	fp16	0.049	0.049	5.5	x1.00
70b-tp2-b4	marlin	0.049	0.050	5.4	x0.99
70b-tp2-b8	fp16	0.049	0.049	10.9	x1.00
70b-tp2-b8	marlin	0.049	0.049	10.9	x1.00

source SHA: 9a18569 shapes: voice impls: fp16,marlin repeats: 5 GPU: NVIDIA A100-SXM4-80GB marlin SHA: 1f25790bdd49fba53106164a24666dade68d7c90 Headline (8b-b1 marlin): 0.7 TFLOPS

Iter-5 cluster smoke caught: ncu in PR #1 produced six CSVs containing zero kernel rows because of ERR_NVGPUCTRPERM. NVIDIA gates GPU performance-counter access via the NVreg_RestrictProfilingToAdminUsers kernel-module flag; the supported in-container bypass is SYS_ADMIN. Without ncu we have wall-clock timing and a Perfetto trace, but no dram__throughput / hmma pct / scheduler stalls — i.e. we cannot classify Marlin kernels into the bottleneck taxonomy in docs/profiling/marlin-bottlenecks.md, which is the W1 exit criterion. Gated by run.allowGpuPerfCounters (default true) so clusters where SYS_ADMIN is forbidden by policy can opt out and accept the analysis gap. Scoped to the profiler container only — the pod has no host mounts so SYS_ADMIN is contained to the GPU profiling syscalls.

Filled in docs/profiling/marlin-bottlenecks.md from PR #1 wall-clock data and first-principles roofline arithmetic. Key finding (and a surprise vs the roadmap's original assumption that we'd be HBM-bound): Marlin sits at a flat ~49us floor across all voice-decode shapes (M=1..8, N=4096-8192, K=4096), independent of batch size. That floor is ~12x the HBM lower bound for INT4 weight loads (4.1us at 8b). Marlin is launch / fixed-overhead bound, NOT memory- or compute-bound. At 8b shapes Marlin LOSES to fp16 cuBLAS (x0.62) because cuBLAS hits a lower fixed floor (~31us). At 70b-tp2 shapes both impls tie at the ~49us floor. This rewrites W2+'s attack surface from 'better in-loop throughput' to 'reduce per-call overhead': persistent kernel + pre-baked workspace + fused scale prologue + CUDA Graph capture of the whole decode call. A larger-batch sweep (M=32..128) is deferred — at those shapes we expect to enter the HBM-bound regime; W1's job was to identify the regime here, and the regime is launch-overhead. Also: documented the DCGM-vs-ncu constraint on voice-agent-flex. ncu counters were blocked by cluster-wide DCGM monitoring (we tried SYS_ADMIN; it's not the kernel-module gate, it's the runtime exclusivity). Bottleneck classification doesn't depend on ncu data because the dominant signal is unambiguous from timing alone. For W2+ deep-dives we'll need ops to dcgmi profile --pause for the run window. Tracked as a TODO in the doc. Updated ROADMAP W1 checklist to reflect what shipped and the surprise finding. Added a 5th category (launch / fixed-overhead bound) to the bottleneck taxonomy in marlin-bottlenecks.md so future analyses can classify into it.

autoresearch: profile run 20260420T010050Z

ac70ba2

source SHA: 9a18569 shapes: voice impls: fp16,marlin repeats: 5 GPU: NVIDIA A100-SXM4-80GB marlin SHA: 1f25790bdd49fba53106164a24666dade68d7c90 Headline (8b-b1 marlin): 0.7 TFLOPS

chokevin added the autoresearch Created by swordfish-autoresearch chart runs label Apr 20, 2026

autoresearch: link PR in INDEX.md

0911b53

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

autoresearch: profile run 20260420T010050Z (9a18569)#1

autoresearch: profile run 20260420T010050Z (9a18569)#1
chokevin wants to merge 2 commits into
mainfrom
autoresearch/profile-20260420T010050Z-9a18569

chokevin commented Apr 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

chokevin commented Apr 20, 2026

Autoresearch run 20260420T010050Z

Results

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Autoresearch run `20260420T010050Z`