Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 8 additions & 0 deletions docs/profiling/INDEX.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
# Autoresearch run index

One row per profiling run produced by the swordfish-autoresearch chart.
Newest first. PR column links to the draft PR carrying the artifacts.

| timestamp (UTC) | source SHA | shapes | impls | GPU | 8b-b1 marlin TFLOPS | run dir | PR |
|---|---|---|---|---|---|---|---|
| 20260420T015907Z | `d212cd6` | voice | fp16,marlin | NVIDIA A100-SXM4-80GB | 0.7 | [`20260420T015907Z/`](./marlin/20260420T015907Z/) | [link](https://github.com/chokevin/swordfish/pull/3) |
Empty file.
10 changes: 10 additions & 0 deletions docs/profiling/marlin/20260420T015907Z/70b-tp2-b1.ncu.csv
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
==PROF== Connected to process 907 (/usr/bin/python3.10)

==ERROR== An error was reported by the driver

==ERROR== Profiling failed because a driver resource was unavailable. Ensure that no other tool (like DCGM) is concurrently collecting profiling data. See https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#faq for more details.
==ERROR== Failed to profile "distribution_elementwise_grid..." in process 907
==PROF== Trying to shutdown target application
==ERROR== The application returned an error code (9).
==ERROR== An error occurred while trying to profile.
==WARNING== No kernels were profiled.
2 changes: 2 additions & 0 deletions docs/profiling/marlin/20260420T015907Z/70b-tp2-b1.ncu.log
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
==ERROR== Option '--section SpeedOfLight --section SpeedOfLight_RooflineChart --section MemoryWorkloadAnalysis --section SchedulerStats --section WarpStateStats --section ComputeWorkloadAnalysis' did not match any section.
==ERROR== Use --list-sections to see the list of available sections.
Empty file.
10 changes: 10 additions & 0 deletions docs/profiling/marlin/20260420T015907Z/70b-tp2-b4.ncu.csv
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
==PROF== Connected to process 996 (/usr/bin/python3.10)

==ERROR== An error was reported by the driver

==ERROR== Profiling failed because a driver resource was unavailable. Ensure that no other tool (like DCGM) is concurrently collecting profiling data. See https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#faq for more details.
==ERROR== Failed to profile "distribution_elementwise_grid..." in process 996
==PROF== Trying to shutdown target application
==ERROR== The application returned an error code (9).
==ERROR== An error occurred while trying to profile.
==WARNING== No kernels were profiled.
2 changes: 2 additions & 0 deletions docs/profiling/marlin/20260420T015907Z/70b-tp2-b4.ncu.log
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
==ERROR== Option '--section SpeedOfLight --section SpeedOfLight_RooflineChart --section MemoryWorkloadAnalysis --section SchedulerStats --section WarpStateStats --section ComputeWorkloadAnalysis' did not match any section.
==ERROR== Use --list-sections to see the list of available sections.
Empty file.
10 changes: 10 additions & 0 deletions docs/profiling/marlin/20260420T015907Z/70b-tp2-b8.ncu.csv
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
==PROF== Connected to process 1085 (/usr/bin/python3.10)

==ERROR== An error was reported by the driver

==ERROR== Profiling failed because a driver resource was unavailable. Ensure that no other tool (like DCGM) is concurrently collecting profiling data. See https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#faq for more details.
==ERROR== Failed to profile "distribution_elementwise_grid..." in process 1085
==PROF== Trying to shutdown target application
==ERROR== The application returned an error code (9).
==ERROR== An error occurred while trying to profile.
==WARNING== No kernels were profiled.
2 changes: 2 additions & 0 deletions docs/profiling/marlin/20260420T015907Z/70b-tp2-b8.ncu.log
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
==ERROR== Option '--section SpeedOfLight --section SpeedOfLight_RooflineChart --section MemoryWorkloadAnalysis --section SchedulerStats --section WarpStateStats --section ComputeWorkloadAnalysis' did not match any section.
==ERROR== Use --list-sections to see the list of available sections.
Empty file.
10 changes: 10 additions & 0 deletions docs/profiling/marlin/20260420T015907Z/8b-b1.ncu.csv
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
==PROF== Connected to process 640 (/usr/bin/python3.10)

==ERROR== An error was reported by the driver

==ERROR== Profiling failed because a driver resource was unavailable. Ensure that no other tool (like DCGM) is concurrently collecting profiling data. See https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#faq for more details.
==ERROR== Failed to profile "distribution_elementwise_grid..." in process 640
==PROF== Trying to shutdown target application
==ERROR== The application returned an error code (9).
==ERROR== An error occurred while trying to profile.
==WARNING== No kernels were profiled.
2 changes: 2 additions & 0 deletions docs/profiling/marlin/20260420T015907Z/8b-b1.ncu.log
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
==ERROR== Option '--section SpeedOfLight --section SpeedOfLight_RooflineChart --section MemoryWorkloadAnalysis --section SchedulerStats --section WarpStateStats --section ComputeWorkloadAnalysis' did not match any section.
==ERROR== Use --list-sections to see the list of available sections.
Empty file.
10 changes: 10 additions & 0 deletions docs/profiling/marlin/20260420T015907Z/8b-b4.ncu.csv
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
==PROF== Connected to process 729 (/usr/bin/python3.10)

==ERROR== An error was reported by the driver

==ERROR== Profiling failed because a driver resource was unavailable. Ensure that no other tool (like DCGM) is concurrently collecting profiling data. See https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#faq for more details.
==ERROR== Failed to profile "distribution_elementwise_grid..." in process 729
==PROF== Trying to shutdown target application
==ERROR== The application returned an error code (9).
==ERROR== An error occurred while trying to profile.
==WARNING== No kernels were profiled.
2 changes: 2 additions & 0 deletions docs/profiling/marlin/20260420T015907Z/8b-b4.ncu.log
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
==ERROR== Option '--section SpeedOfLight --section SpeedOfLight_RooflineChart --section MemoryWorkloadAnalysis --section SchedulerStats --section WarpStateStats --section ComputeWorkloadAnalysis' did not match any section.
==ERROR== Use --list-sections to see the list of available sections.
Empty file.
10 changes: 10 additions & 0 deletions docs/profiling/marlin/20260420T015907Z/8b-b8.ncu.csv
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
==PROF== Connected to process 818 (/usr/bin/python3.10)

==ERROR== An error was reported by the driver

==ERROR== Profiling failed because a driver resource was unavailable. Ensure that no other tool (like DCGM) is concurrently collecting profiling data. See https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#faq for more details.
==ERROR== Failed to profile "distribution_elementwise_grid..." in process 818
==PROF== Trying to shutdown target application
==ERROR== The application returned an error code (9).
==ERROR== An error occurred while trying to profile.
==WARNING== No kernels were profiled.
2 changes: 2 additions & 0 deletions docs/profiling/marlin/20260420T015907Z/8b-b8.ncu.log
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
==ERROR== Option '--section SpeedOfLight --section SpeedOfLight_RooflineChart --section MemoryWorkloadAnalysis --section SchedulerStats --section WarpStateStats --section ComputeWorkloadAnalysis' did not match any section.
==ERROR== Use --list-sections to see the list of available sections.
26 changes: 26 additions & 0 deletions docs/profiling/marlin/20260420T015907Z/SUMMARY.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
# Autoresearch run `20260420T015907Z`

- **source SHA:** `d212cd6`
- **GPU:** NVIDIA A100-SXM4-80GB (cc 8.0, 79.3 GB)
- **CUDA / torch / triton:** 12.4 / 2.4.0a0+07cecf4168.nv24.05 / 3.0.0
- **shapes:** `voice` **impls:** `fp16,marlin` **repeats:** 5
- **marlin SHA:** `1f25790bdd49fba53106164a24666dade68d7c90`

## Results

| shape | impl | ms_mean | ms_p95 | TFLOPS | speedup vs fp16 | error |
|---|---|---|---|---|---|---|
| 8b-b1 | fp16 | 0.032 | 0.034 | 1.1 | x1.00 | |
| 8b-b1 | marlin | 0.049 | 0.050 | 0.7 | x0.65 | |
| 8b-b4 | fp16 | 0.031 | 0.032 | 4.3 | x1.00 | |
| 8b-b4 | marlin | 0.049 | 0.050 | 2.7 | x0.64 | |
| 8b-b8 | fp16 | 0.032 | 0.032 | 8.5 | x1.00 | |
| 8b-b8 | marlin | 0.049 | 0.050 | 5.4 | x0.64 | |
| 70b-tp2-b1 | fp16 | 0.051 | 0.059 | 1.3 | x1.00 | |
| 70b-tp2-b1 | marlin | 0.050 | 0.051 | 1.4 | x1.03 | |
| 70b-tp2-b4 | fp16 | 0.049 | 0.050 | 5.4 | x1.00 | |
| 70b-tp2-b4 | marlin | 0.049 | 0.049 | 5.5 | x1.01 | |
| 70b-tp2-b8 | fp16 | 0.050 | 0.050 | 10.8 | x1.00 | |
| 70b-tp2-b8 | marlin | 0.048 | 0.049 | 11.1 | x1.02 | |

![roofline](./roofline.png)
43 changes: 43 additions & 0 deletions docs/profiling/marlin/20260420T015907Z/env.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,43 @@
=== profile_marlin.sh @ 20260420T015907Z ===
--- host ---
Linux swordfish-profile-sf-prof-260420-015138-7xlss 6.6.126.1-1.azl3 #1 SMP PREEMPT_DYNAMIC Wed Mar 4 05:04:40 UTC 2026 x86_64 x86_64 x86_64 GNU/Linux
--- nvidia-smi ---
Mon Apr 20 01:59:07 2026
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 580.105.08 Driver Version: 580.105.08 CUDA Version: 13.0 |
+-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA A100-SXM4-80GB On | 0000000E:00:00.0 Off | 0 |
| N/A 34C P0 69W / 400W | 0MiB / 81920MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| No running processes found |
+-----------------------------------------------------------------------------------------+
--- nvcc ---
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2024 NVIDIA Corporation
Built on Thu_Mar_28_02:18:24_PDT_2024
Cuda compilation tools, release 12.4, V12.4.131
Build cuda_12.4.r12.4/compiler.34097967_0
--- nsys ---
NVIDIA Nsight Systems version 2024.2.1.106-242134037904v0
--- ncu ---
NVIDIA (R) Nsight Compute Command Line Profiler
Copyright (c) 2018-2024 NVIDIA Corporation
Version 2024.1.1.0 (build 33998838) (public-release)
--- python / torch / triton / marlin ---
python 3.10.12
torch 2.4.0a0+07cecf4168.nv24.05 cuda 12.4
triton 3.0.0
marlin unknown
--- repo SHA ---
d212cd669477b14cb33a352af64ae27dee669634
223 changes: 223 additions & 0 deletions docs/profiling/marlin/20260420T015907Z/manifest.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,223 @@
{
"env": {
"host": "swordfish-profile-sf-prof-260420-015138-7xlss",
"os": "Linux 6.6.126.1-1.azl3",
"python": "3.10.12",
"torch": "2.4.0a0+07cecf4168.nv24.05",
"cuda_available": true,
"timestamp": "2026-04-20T01:59:13+0000",
"gpu_name": "NVIDIA A100-SXM4-80GB",
"gpu_cc": "8.0",
"gpu_mem_gb": 79.3,
"gpu_sm_count": 108,
"torch_cuda": "12.4",
"cudnn": 90100,
"triton": "3.0.0"
},
"rows": [
{
"name": "8b-b1",
"M": 1,
"N": 4096,
"K": 4096,
"group_size": 128,
"priority": 0,
"tag": "llama-3-8b",
"impl": "fp16",
"ms_mean": 0.03189593601226807,
"ms_p50": 0.03144511938095093,
"ms_p95": 0.033960320949554444,
"ms_min": 0.030894720554351808,
"tflops_mean": 1.051997094146854,
"error": null,
"speedup_vs_fp16": 1.0
},
{
"name": "8b-b1",
"M": 1,
"N": 4096,
"K": 4096,
"group_size": 128,
"priority": 0,
"tag": "llama-3-8b",
"impl": "marlin",
"ms_mean": 0.049192959785461426,
"ms_p50": 0.04912576198577881,
"ms_p95": 0.04963967800140381,
"ms_min": 0.04897664070129395,
"tflops_mean": 0.6820982544318616,
"error": null,
"speedup_vs_fp16": 0.6483841621112346
},
{
"name": "8b-b4",
"M": 4,
"N": 4096,
"K": 4096,
"group_size": 128,
"priority": 0,
"tag": "llama-3-8b",
"impl": "fp16",
"ms_mean": 0.03144793605804443,
"ms_p50": 0.03157952070236206,
"ms_p95": 0.03182080030441284,
"ms_min": 0.031113600730895995,
"tflops_mean": 4.267934396466279,
"error": null,
"speedup_vs_fp16": 1.0
},
{
"name": "8b-b4",
"M": 4,
"N": 4096,
"K": 4096,
"group_size": 128,
"priority": 0,
"tag": "llama-3-8b",
"impl": "marlin",
"ms_mean": 0.04927884769439697,
"ms_p50": 0.04931519985198975,
"ms_p95": 0.04962495803833008,
"ms_min": 0.048726401329040527,
"tflops_mean": 2.7236377123172995,
"error": null,
"speedup_vs_fp16": 0.6381629751789037
},
{
"name": "8b-b8",
"M": 8,
"N": 4096,
"K": 4096,
"group_size": 128,
"priority": 0,
"tag": "llama-3-8b",
"impl": "fp16",
"ms_mean": 0.03154764842987061,
"ms_p50": 0.031646080017089843,
"ms_p95": 0.03203840017318726,
"ms_min": 0.03096640110015869,
"tflops_mean": 8.508889548351702,
"error": null,
"speedup_vs_fp16": 1.0
},
{
"name": "8b-b8",
"M": 8,
"N": 4096,
"K": 4096,
"group_size": 128,
"priority": 0,
"tag": "llama-3-8b",
"impl": "marlin",
"ms_mean": 0.04929036808013916,
"ms_p50": 0.0494700813293457,
"ms_p95": 0.04981311798095703,
"ms_min": 0.04864255905151367,
"tflops_mean": 5.44600226079793,
"error": null,
"speedup_vs_fp16": 0.6400367791650207
},
{
"name": "70b-tp2-b1",
"M": 1,
"N": 8192,
"K": 4096,
"group_size": 128,
"priority": 0,
"tag": "llama-3-70b",
"impl": "fp16",
"ms_mean": 0.051108479499816895,
"ms_p50": 0.049178881645202635,
"ms_p95": 0.059057278633117674,
"ms_min": 0.04889984130859375,
"tflops_mean": 1.3130671203051625,
"error": null,
"speedup_vs_fp16": 1.0
},
{
"name": "70b-tp2-b1",
"M": 1,
"N": 8192,
"K": 4096,
"group_size": 128,
"priority": 0,
"tag": "llama-3-70b",
"impl": "marlin",
"ms_mean": 0.04959603214263916,
"ms_p50": 0.049487361907958986,
"ms_p95": 0.05057280063629151,
"ms_min": 0.04831935882568359,
"tflops_mean": 1.353109535194138,
"error": null,
"speedup_vs_fp16": 1.0304953298043662
},
{
"name": "70b-tp2-b4",
"M": 4,
"N": 8192,
"K": 4096,
"group_size": 128,
"priority": 0,
"tag": "llama-3-70b",
"impl": "fp16",
"ms_mean": 0.04933452701568604,
"ms_p50": 0.04925183773040771,
"ms_p95": 0.04956480026245117,
"ms_min": 0.04922239780426026,
"tflops_mean": 5.4411275882841705,
"error": null,
"speedup_vs_fp16": 1.0
},
{
"name": "70b-tp2-b4",
"M": 4,
"N": 8192,
"K": 4096,
"group_size": 128,
"priority": 0,
"tag": "llama-3-70b",
"impl": "marlin",
"ms_mean": 0.04871065616607666,
"ms_p50": 0.04873280048370361,
"ms_p95": 0.04931136131286621,
"ms_min": 0.04817728042602539,
"tflops_mean": 5.510815848687854,
"error": null,
"speedup_vs_fp16": 1.0128076872436766
},
{
"name": "70b-tp2-b8",
"M": 8,
"N": 8192,
"K": 4096,
"group_size": 128,
"priority": 0,
"tag": "llama-3-70b",
"impl": "fp16",
"ms_mean": 0.04958604812622071,
"ms_p50": 0.04956863880157471,
"ms_p95": 0.049699201583862304,
"ms_min": 0.04951680183410644,
"tflops_mean": 10.827055841058382,
"error": null,
"speedup_vs_fp16": 1.0
},
{
"name": "70b-tp2-b8",
"M": 8,
"N": 8192,
"K": 4096,
"group_size": 128,
"priority": 0,
"tag": "llama-3-70b",
"impl": "marlin",
"ms_mean": 0.0484769287109375,
"ms_p50": 0.048438401222229005,
"ms_p95": 0.048990721702575686,
"ms_min": 0.04785344123840332,
"tflops_mean": 11.074771572293722,
"error": null,
"speedup_vs_fp16": 1.0228793251712947
}
]
}
Loading
Loading