Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 8 additions & 0 deletions docs/profiling/INDEX.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
# Autoresearch run index

One row per profiling run produced by the swordfish-autoresearch chart.
Newest first. PR column links to the draft PR carrying the artifacts.

| timestamp (UTC) | source SHA | shapes | impls | GPU | 8b-b1 marlin TFLOPS | run dir | PR |
|---|---|---|---|---|---|---|---|
| 20260420T014943Z | `20ab7f3` | voice | fp16,marlin | NVIDIA A100-SXM4-80GB | 0.7 | [`20260420T014943Z/`](./marlin/20260420T014943Z/) | [link](https://github.com/chokevin/swordfish/pull/2) |
10 changes: 10 additions & 0 deletions docs/profiling/marlin/20260420T014943Z/70b-tp2-b1.ncu.csv
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
==PROF== Connected to process 906 (/usr/bin/python3.10)

==ERROR== An error was reported by the driver

==ERROR== Profiling failed because a driver resource was unavailable. Ensure that no other tool (like DCGM) is concurrently collecting profiling data. See https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#faq for more details.
==ERROR== Failed to profile "distribution_elementwise_grid..." in process 906
==PROF== Trying to shutdown target application
==ERROR== The application returned an error code (9).
==ERROR== An error occurred while trying to profile.
==WARNING== No kernels were profiled.
10 changes: 10 additions & 0 deletions docs/profiling/marlin/20260420T014943Z/70b-tp2-b4.ncu.csv
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
==PROF== Connected to process 995 (/usr/bin/python3.10)

==ERROR== An error was reported by the driver

==ERROR== Profiling failed because a driver resource was unavailable. Ensure that no other tool (like DCGM) is concurrently collecting profiling data. See https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#faq for more details.
==ERROR== Failed to profile "distribution_elementwise_grid..." in process 995
==PROF== Trying to shutdown target application
==ERROR== The application returned an error code (9).
==ERROR== An error occurred while trying to profile.
==WARNING== No kernels were profiled.
10 changes: 10 additions & 0 deletions docs/profiling/marlin/20260420T014943Z/70b-tp2-b8.ncu.csv
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
==PROF== Connected to process 1084 (/usr/bin/python3.10)

==ERROR== An error was reported by the driver

==ERROR== Profiling failed because a driver resource was unavailable. Ensure that no other tool (like DCGM) is concurrently collecting profiling data. See https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#faq for more details.
==ERROR== Failed to profile "distribution_elementwise_grid..." in process 1084
==PROF== Trying to shutdown target application
==ERROR== The application returned an error code (9).
==ERROR== An error occurred while trying to profile.
==WARNING== No kernels were profiled.
10 changes: 10 additions & 0 deletions docs/profiling/marlin/20260420T014943Z/8b-b1.ncu.csv
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
==PROF== Connected to process 639 (/usr/bin/python3.10)

==ERROR== An error was reported by the driver

==ERROR== Profiling failed because a driver resource was unavailable. Ensure that no other tool (like DCGM) is concurrently collecting profiling data. See https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#faq for more details.
==ERROR== Failed to profile "distribution_elementwise_grid..." in process 639
==PROF== Trying to shutdown target application
==ERROR== The application returned an error code (9).
==ERROR== An error occurred while trying to profile.
==WARNING== No kernels were profiled.
10 changes: 10 additions & 0 deletions docs/profiling/marlin/20260420T014943Z/8b-b4.ncu.csv
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
==PROF== Connected to process 728 (/usr/bin/python3.10)

==ERROR== An error was reported by the driver

==ERROR== Profiling failed because a driver resource was unavailable. Ensure that no other tool (like DCGM) is concurrently collecting profiling data. See https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#faq for more details.
==ERROR== Failed to profile "distribution_elementwise_grid..." in process 728
==PROF== Trying to shutdown target application
==ERROR== The application returned an error code (9).
==ERROR== An error occurred while trying to profile.
==WARNING== No kernels were profiled.
10 changes: 10 additions & 0 deletions docs/profiling/marlin/20260420T014943Z/8b-b8.ncu.csv
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
==PROF== Connected to process 817 (/usr/bin/python3.10)

==ERROR== An error was reported by the driver

==ERROR== Profiling failed because a driver resource was unavailable. Ensure that no other tool (like DCGM) is concurrently collecting profiling data. See https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#faq for more details.
==ERROR== Failed to profile "distribution_elementwise_grid..." in process 817
==PROF== Trying to shutdown target application
==ERROR== The application returned an error code (9).
==ERROR== An error occurred while trying to profile.
==WARNING== No kernels were profiled.
26 changes: 26 additions & 0 deletions docs/profiling/marlin/20260420T014943Z/SUMMARY.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
# Autoresearch run `20260420T014943Z`

- **source SHA:** `20ab7f3`
- **GPU:** NVIDIA A100-SXM4-80GB (cc 8.0, 79.3 GB)
- **CUDA / torch / triton:** 12.4 / 2.4.0a0+07cecf4168.nv24.05 / 3.0.0
- **shapes:** `voice` **impls:** `fp16,marlin` **repeats:** 5
- **marlin SHA:** `1f25790bdd49fba53106164a24666dade68d7c90`

## Results

| shape | impl | ms_mean | ms_p95 | TFLOPS | speedup vs fp16 | error |
|---|---|---|---|---|---|---|
| 8b-b1 | fp16 | 0.031 | 0.033 | 1.1 | x1.00 | |
| 8b-b1 | marlin | 0.049 | 0.051 | 0.7 | x0.64 | |
| 8b-b4 | fp16 | 0.031 | 0.031 | 4.4 | x1.00 | |
| 8b-b4 | marlin | 0.050 | 0.050 | 2.7 | x0.61 | |
| 8b-b8 | fp16 | 0.031 | 0.032 | 8.6 | x1.00 | |
| 8b-b8 | marlin | 0.050 | 0.050 | 5.4 | x0.63 | |
| 70b-tp2-b1 | fp16 | 0.051 | 0.056 | 1.3 | x1.00 | |
| 70b-tp2-b1 | marlin | 0.049 | 0.050 | 1.4 | x1.02 | |
| 70b-tp2-b4 | fp16 | 0.049 | 0.050 | 5.4 | x1.00 | |
| 70b-tp2-b4 | marlin | 0.066 | 0.133 | 4.1 | x0.75 | |
| 70b-tp2-b8 | fp16 | 0.049 | 0.050 | 10.8 | x1.00 | |
| 70b-tp2-b8 | marlin | 0.049 | 0.049 | 10.9 | x1.01 | |

![roofline](./roofline.png)
43 changes: 43 additions & 0 deletions docs/profiling/marlin/20260420T014943Z/env.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,43 @@
=== profile_marlin.sh @ 20260420T014943Z ===
--- host ---
Linux swordfish-profile-sf-prof-260420-014607-r5hlj 6.6.126.1-1.azl3 #1 SMP PREEMPT_DYNAMIC Wed Mar 4 05:04:40 UTC 2026 x86_64 x86_64 x86_64 GNU/Linux
--- nvidia-smi ---
Mon Apr 20 01:49:43 2026
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 580.105.08 Driver Version: 580.105.08 CUDA Version: 13.0 |
+-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA A100-SXM4-80GB On | 0000000E:00:00.0 Off | 0 |
| N/A 34C P0 70W / 400W | 0MiB / 81920MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| No running processes found |
+-----------------------------------------------------------------------------------------+
--- nvcc ---
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2024 NVIDIA Corporation
Built on Thu_Mar_28_02:18:24_PDT_2024
Cuda compilation tools, release 12.4, V12.4.131
Build cuda_12.4.r12.4/compiler.34097967_0
--- nsys ---
NVIDIA Nsight Systems version 2024.2.1.106-242134037904v0
--- ncu ---
NVIDIA (R) Nsight Compute Command Line Profiler
Copyright (c) 2018-2024 NVIDIA Corporation
Version 2024.1.1.0 (build 33998838) (public-release)
--- python / torch / triton / marlin ---
python 3.10.12
torch 2.4.0a0+07cecf4168.nv24.05 cuda 12.4
triton 3.0.0
marlin unknown
--- repo SHA ---
20ab7f3e9e603f905be9e51b853e15d628b64241
223 changes: 223 additions & 0 deletions docs/profiling/marlin/20260420T014943Z/manifest.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,223 @@
{
"env": {
"host": "swordfish-profile-sf-prof-260420-014607-r5hlj",
"os": "Linux 6.6.126.1-1.azl3",
"python": "3.10.12",
"torch": "2.4.0a0+07cecf4168.nv24.05",
"cuda_available": true,
"timestamp": "2026-04-20T01:49:49+0000",
"gpu_name": "NVIDIA A100-SXM4-80GB",
"gpu_cc": "8.0",
"gpu_mem_gb": 79.3,
"gpu_sm_count": 108,
"torch_cuda": "12.4",
"cudnn": 90100,
"triton": "3.0.0"
},
"rows": [
{
"name": "8b-b1",
"M": 1,
"N": 4096,
"K": 4096,
"group_size": 128,
"priority": 0,
"tag": "llama-3-8b",
"impl": "fp16",
"ms_mean": 0.03149350357055664,
"ms_p50": 0.031086080074310303,
"ms_p95": 0.03273663997650147,
"ms_min": 0.030670719146728517,
"tflops_mean": 1.0654397953795818,
"error": null,
"speedup_vs_fp16": 1.0
},
{
"name": "8b-b1",
"M": 1,
"N": 4096,
"K": 4096,
"group_size": 128,
"priority": 0,
"tag": "llama-3-8b",
"impl": "marlin",
"ms_mean": 0.049491456031799316,
"ms_p50": 0.049548802375793455,
"ms_p95": 0.05056960105895996,
"ms_min": 0.04887360095977783,
"tflops_mean": 0.6779843369013141,
"error": null,
"speedup_vs_fp16": 0.6363422314817611
},
{
"name": "8b-b4",
"M": 4,
"N": 4096,
"K": 4096,
"group_size": 128,
"priority": 0,
"tag": "llama-3-8b",
"impl": "fp16",
"ms_mean": 0.03054387187957764,
"ms_p50": 0.030387840270996093,
"ms_p95": 0.030972800254821777,
"ms_min": 0.03024319887161255,
"tflops_mean": 4.394260443769775,
"error": null,
"speedup_vs_fp16": 1.0
},
{
"name": "8b-b4",
"M": 4,
"N": 4096,
"K": 4096,
"group_size": 128,
"priority": 0,
"tag": "llama-3-8b",
"impl": "marlin",
"ms_mean": 0.04976857566833496,
"ms_p50": 0.04982399940490723,
"ms_p95": 0.05031424045562744,
"ms_min": 0.049259519577026366,
"tflops_mean": 2.6968368332348205,
"error": null,
"speedup_vs_fp16": 0.6137180232588221
},
{
"name": "8b-b8",
"M": 8,
"N": 4096,
"K": 4096,
"group_size": 128,
"priority": 0,
"tag": "llama-3-8b",
"impl": "fp16",
"ms_mean": 0.031084928035736082,
"ms_p50": 0.0308134388923645,
"ms_p95": 0.03185983896255493,
"ms_min": 0.03056960105895996,
"tflops_mean": 8.635550183400756,
"error": null,
"speedup_vs_fp16": 1.0
},
{
"name": "8b-b8",
"M": 8,
"N": 4096,
"K": 4096,
"group_size": 128,
"priority": 0,
"tag": "llama-3-8b",
"impl": "marlin",
"ms_mean": 0.04965529632568359,
"ms_p50": 0.04962175846099853,
"ms_p95": 0.05010816097259521,
"ms_min": 0.049258241653442385,
"tflops_mean": 5.405978331885517,
"error": null,
"speedup_vs_fp16": 0.626014349644668
},
{
"name": "70b-tp2-b1",
"M": 1,
"N": 8192,
"K": 4096,
"group_size": 128,
"priority": 0,
"tag": "llama-3-70b",
"impl": "fp16",
"ms_mean": 0.050512895584106446,
"ms_p50": 0.04921984195709229,
"ms_p95": 0.05594431877136231,
"ms_min": 0.04904191970825195,
"tflops_mean": 1.3285491402539071,
"error": null,
"speedup_vs_fp16": 1.0
},
{
"name": "70b-tp2-b1",
"M": 1,
"N": 8192,
"K": 4096,
"group_size": 128,
"priority": 0,
"tag": "llama-3-70b",
"impl": "marlin",
"ms_mean": 0.04932083320617676,
"ms_p50": 0.04902592182159424,
"ms_p95": 0.05009727954864502,
"ms_min": 0.048891520500183104,
"tflops_mean": 1.360659576034809,
"error": null,
"speedup_vs_fp16": 1.0241695506835111
},
{
"name": "70b-tp2-b4",
"M": 4,
"N": 8192,
"K": 4096,
"group_size": 128,
"priority": 0,
"tag": "llama-3-70b",
"impl": "fp16",
"ms_mean": 0.049335807800292966,
"ms_p50": 0.04932799816131592,
"ms_p95": 0.04952127933502197,
"ms_min": 0.049187841415405276,
"tflops_mean": 5.440986333630194,
"error": null,
"speedup_vs_fp16": 1.0
},
{
"name": "70b-tp2-b4",
"M": 4,
"N": 8192,
"K": 4096,
"group_size": 128,
"priority": 0,
"tag": "llama-3-70b",
"impl": "marlin",
"ms_mean": 0.06605465698242188,
"ms_p50": 0.04948544025421143,
"ms_p95": 0.1330508804321289,
"ms_min": 0.04902656078338623,
"tflops_mean": 4.063838467459374,
"error": null,
"speedup_vs_fp16": 0.7468937097564817
},
{
"name": "70b-tp2-b8",
"M": 8,
"N": 8192,
"K": 4096,
"group_size": 128,
"priority": 0,
"tag": "llama-3-70b",
"impl": "fp16",
"ms_mean": 0.049490047454833985,
"ms_p50": 0.04956799983978272,
"ms_p95": 0.0496127986907959,
"ms_min": 0.049237117767333985,
"tflops_mean": 10.848058137142898,
"error": null,
"speedup_vs_fp16": 1.0
},
{
"name": "70b-tp2-b8",
"M": 8,
"N": 8192,
"K": 4096,
"group_size": 128,
"priority": 0,
"tag": "llama-3-70b",
"impl": "marlin",
"ms_mean": 0.04924096012115479,
"ms_p50": 0.04936704158782959,
"ms_p95": 0.049479680061340334,
"ms_min": 0.04883967876434326,
"tflops_mean": 10.902933465940905,
"error": null,
"speedup_vs_fp16": 1.0050585393352673
}
]
}
13 changes: 13 additions & 0 deletions docs/profiling/marlin/20260420T014943Z/results.csv
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
name,impl,M,N,K,group_size,priority,tag,ms_mean,ms_p50,ms_p95,ms_min,tflops_mean,speedup_vs_fp16,error
8b-b1,fp16,1,4096,4096,128,0,llama-3-8b,0.03149350357055664,0.031086080074310303,0.03273663997650147,0.030670719146728517,1.0654397953795818,1.0,
8b-b1,marlin,1,4096,4096,128,0,llama-3-8b,0.049491456031799316,0.049548802375793455,0.05056960105895996,0.04887360095977783,0.6779843369013141,0.6363422314817611,
8b-b4,fp16,4,4096,4096,128,0,llama-3-8b,0.03054387187957764,0.030387840270996093,0.030972800254821777,0.03024319887161255,4.394260443769775,1.0,
8b-b4,marlin,4,4096,4096,128,0,llama-3-8b,0.04976857566833496,0.04982399940490723,0.05031424045562744,0.049259519577026366,2.6968368332348205,0.6137180232588221,
8b-b8,fp16,8,4096,4096,128,0,llama-3-8b,0.031084928035736082,0.0308134388923645,0.03185983896255493,0.03056960105895996,8.635550183400756,1.0,
8b-b8,marlin,8,4096,4096,128,0,llama-3-8b,0.04965529632568359,0.04962175846099853,0.05010816097259521,0.049258241653442385,5.405978331885517,0.626014349644668,
70b-tp2-b1,fp16,1,8192,4096,128,0,llama-3-70b,0.050512895584106446,0.04921984195709229,0.05594431877136231,0.04904191970825195,1.3285491402539071,1.0,
70b-tp2-b1,marlin,1,8192,4096,128,0,llama-3-70b,0.04932083320617676,0.04902592182159424,0.05009727954864502,0.048891520500183104,1.360659576034809,1.0241695506835111,
70b-tp2-b4,fp16,4,8192,4096,128,0,llama-3-70b,0.049335807800292966,0.04932799816131592,0.04952127933502197,0.049187841415405276,5.440986333630194,1.0,
70b-tp2-b4,marlin,4,8192,4096,128,0,llama-3-70b,0.06605465698242188,0.04948544025421143,0.1330508804321289,0.04902656078338623,4.063838467459374,0.7468937097564817,
70b-tp2-b8,fp16,8,8192,4096,128,0,llama-3-70b,0.049490047454833985,0.04956799983978272,0.0496127986907959,0.049237117767333985,10.848058137142898,1.0,
70b-tp2-b8,marlin,8,8192,4096,128,0,llama-3-70b,0.04924096012115479,0.04936704158782959,0.049479680061340334,0.04883967876434326,10.902933465940905,1.0050585393352673,
Loading
Loading