Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
12 changes: 12 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -13,3 +13,15 @@
# OS noise
Thumbs.db
.DS_Store

# Internal tools (not shipped with the compiler, These are surprises for later ;-) )
/tools/

# AI assistants — keep your configs to yourself
.claude/
.cursorrules
.cursor/
.gemini/
.github/copilot/
CLAUDE.md
.aider*
20 changes: 17 additions & 3 deletions CONTRIBUTING.md
Original file line number Diff line number Diff line change
Expand Up @@ -84,8 +84,14 @@ When I was a kid learning Lua on Roblox, I would actually copy and paste scripts

**What's not acceptable**

- Generating code you don't understand - When I was writing Callout, my Call and Dispatch engine (it's what Emergency services use when dispatching a firetruck because you burnt toast and now the alarm is going off), I hit a wall. I know systems but had no idea on how to properly add a button or a UI element. I found myself relying on my Ollama model too much and eventually couldn't understand what I was making. BarraCUDA requires bit level precision as it emits machine code. If you want to submit a PR but don't understand a section of the codebase or don't understand everything, that is fine, that's being human. You are more than welcome to submit a PR, even an incomplete one, and we can discuss tradeoffs and implementations. We are all learning. Learning is what makes us, us.
- Architecture - As above please don't make architectural decisions using a chatbot. Even then if you're making a big change in the code anyway feel free to contact me, I'm always happy to chat and open to new ideas.
- **Generating code you don't understand** - When I was writing Callout, my Call and Dispatch engine (it's what Emergency services use when dispatching a firetruck because you burnt toast and now the alarm is going off), I hit a wall. I know systems but had no idea on how to properly add a button or a UI element. I found myself relying on my Ollama model too much and eventually couldn't understand what I was making. BarraCUDA requires bit level precision as it emits machine code. If you want to submit a PR but don't understand a section of the codebase or don't understand everything, that is fine, that's being human. You are more than welcome to submit a PR, even an incomplete one, and we can discuss tradeoffs and implementations. We are all learning. Learning is what makes us, us.
- **Wholly synthetic undeclared code** is something we'll have to send back or rework together. If you've used an LLM, just say so — declared LLM-assisted code that you understand and can defend is absolutely fine. The copyright picture is genuinely unsettled though, so occasionally I might ask you to rewrite a section from scratch. Here's why there's caution:
- **Licence contamination** — LLM training data can include proprietary or incompatibly-licensed code. If it leaks into a PR, it poisons the Apache 2.0 licence for the whole project.
- **Copyright** — the wonderful folks in that small outfit known as "The United States Federal Government" have ruled that a human has to substantially author or alter code for it to be copyrightable. Unaltered LLM output may not be copyrightable at all, which means it can't actually be licensed under Apache 2.0. Now I'm not in the US, I'm in New Zealand, and our laws are actually more reasonable, but US lawyers aren't exactly well known for their geography knowledge.
- **Quality** — this is a compiler that emits GPU machine code. One wrong bit is silent data corruption. You need to understand what you're shipping.

The short version: declare your tools, understand your code, and if I ask you to rewrite something it's not personal — it's just the reality of shipping code in a world where the law hasn't caught up yet. Treat LLM output like a snippet from Stack Overflow — make sure you can explain it. And if you're struggling, that's okay. Mark a PR as a draft, raise a discussion, and I'll help when I can.
- **Architecture** - As above please don't make architectural decisions using a chatbot. Even then if you're making a big change in the code anyway feel free to contact me, I'm always happy to chat and open to new ideas.

## On the Mighty Emdash
You'll notice emdashes everywhere — in comments, in commit messages, in this document. I've been drawing hyphens a bit too long since primary school. I have my old books from when I was seven, and there they are — emdashes. Or hyphens. Or maybe I just didn't know what I was writing. 7-year-old me didn't leave a comment.
Expand All @@ -94,6 +100,14 @@ The point is: use them. They're better than parentheses for asides, better than

The hate for emdashes is superficial and weird. I understand its because of LLM's. But I'm not going to let some Robot dictate how I share my own thoughts in my own language.

## Indigenous and Endangered Languages

The first non-English language in BarraCUDA's error messages was te reo Māori. Not because it was strategic, but because I live here, these are my neighbours, and this is one of the three official languages of Aotearoa New Zealand. You can run `barracuda --lang lang/mi.txt` right now and get your errors in te reo. Kia ora, GPU.

There are roughly 7,000 languages spoken on Earth. About 40% of them are endangered. When those languages disappear from technology — when every error message, every man page, every compiler diagnostic is English-only — it sends a quiet message: this isn't for you. That matters more than most developers realise. If your tools don't speak your language, you're not just reading code, you're translating infrastructure. The research on cognitive load is absolutely clear: that translation costs time, costs accuracy, and costs people who would have been brilliant engineers. When you're reading one of my glorious Abend dumps because you decided it would be a great idea to chuck your MEGAGPT10 onto your RDNA 2 stick you should at least be able to read the consequences of your bad decisions in your own tongue.

Indigenous and endangered languages are especially welcome here. Te reo Māori, Welsh, Hawaiian, Navajo, Scots Gaelic, Samoan, any of the hundreds of languages that technology has quietly decided don't matter — if you want to see your language in a compiler diagnostic, this is that project. The translation format is dead simple and needs zero compiler knowledge, which also makes it a neat entry point into compiler development too! See the Where to Help section below for how.

## Where to Help

Check `Issues` for current priorities. In general the most impactful areas are:
Expand Down Expand Up @@ -143,7 +157,7 @@ llvm-objdump -d --mcpu=gfx1100 output.hsaco

## License

BarraCUDA is Apache 2.0. By submitting a PR, you agree your contribution is licensed under the same terms.
BarraCUDA is Apache 2.0. By submitting a PR, you agree your contribution is licensed under the same terms and you represent that you have the right to do so — meaning the code is your own work, or derived from compatibly-licensed sources, and not copied from proprietary material.

---

Expand Down
4 changes: 2 additions & 2 deletions Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@ LIBS = -lm
SOURCES = src/main.c \
src/fe/bc_err.c src/fe/preproc.c src/fe/lexer.c src/fe/parser.c src/fe/sema.c \
src/ir/bir.c src/ir/bir_print.c src/ir/bir_lower.c src/ir/bir_mem2reg.c src/ir/bir_cfold.c src/ir/bir_dce.c \
src/amdgpu/isel.c src/amdgpu/emit.c src/amdgpu/encode.c src/amdgpu/enc_tab.c src/amdgpu/sched.c src/amdgpu/verify.c \
src/amdgpu/amd_rplan.c src/amdgpu/isel.c src/amdgpu/emit.c src/amdgpu/ra_ssa.c src/amdgpu/encode.c src/amdgpu/enc_tab.c src/amdgpu/sched.c src/amdgpu/verify.c \
src/tensix/isel.c src/tensix/emit.c src/tensix/coarsen.c src/tensix/datamov.c
OBJECTS = $(SOURCES:.c=.o)
TARGET = barracuda
Expand All @@ -39,7 +39,7 @@ TSRC = tests/tmain.c tests/tsmoke.c tests/tcomp.c tests/tenc.c \
tests/tregalloc.c
TOBJS = $(TSRC:.c=.o)
COBJS = src/ir/bir.o src/ir/bir_print.o src/ir/bir_lower.o src/ir/bir_mem2reg.o src/ir/bir_cfold.o src/ir/bir_dce.o \
src/amdgpu/encode.o src/amdgpu/enc_tab.o src/amdgpu/isel.o src/amdgpu/emit.o src/amdgpu/sched.o src/amdgpu/verify.o \
src/amdgpu/amd_rplan.o src/amdgpu/encode.o src/amdgpu/enc_tab.o src/amdgpu/isel.o src/amdgpu/emit.o src/amdgpu/ra_ssa.o src/amdgpu/sched.o src/amdgpu/verify.o \
src/fe/bc_err.o src/fe/lexer.o src/fe/parser.o src/fe/preproc.o src/fe/sema.o \
src/runtime/bc_abend.o

Expand Down
103 changes: 103 additions & 0 deletions spill_analysis.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,103 @@
Divergence-Aware SSA Register Allocator — Spill Analysis
=========================================================

Date: 2026-03-14
Kernel: Moa transport kernel (gpu/tp_kern.cu)
Monte Carlo neutron transport, 654 source lines
253 MIR blocks, 4035 virtual registers (3933V 102S)
3854 divergent VGPRs, 79 uniform VGPRs
Target: GFX942 (CDNA3, MI300X), Wave64

The new SSA register allocator (ra_ssa) eliminates ALL 186 VGPR spills
on the Moa transport kernel. Total scratch traffic drops by 78%.
Total emitted instruction count drops from 9,448 to 6,761 (28.4%).

on Wave64 hardware, spilling a divergent VGPR costs 64 dwords of scratch per lane. Spilling a
uniform VGPR costs 1 dword via v_readfirstlane. The old allocator
treated all spills equally, the new one exploits the 64:1 cost ratio.

Old (ra_gc) SSA (ra_ssa) Change
----------- ----------- -------
VGPR spills 186 0 -100% (HOLY!)
SGPR spills 21 29 +38%
Total spills 207 29 -86%
Scratch ops (store+load) 1,754 392 -78%
Scratch bytes 1,396 272 -81%
VGPRs used 250 237 -5%
SGPRs used 102 102 0%
Total emitted instructions 9,448 6,761 -28%
v_readfirstlane (SGPR path) 0 39 new

The 29 SGPR spills are cheap (4 bytes each via VGPR relay to scratch).
The 39 v_readfirstlane instructions are the scalar extraction path
for SGPR spill/reload, each replaces what would have been a 256-byte
per-lane scratch store on the old allocator.

Algorithm
---------
1. CFG + Cooper et al. (2001) iterative dominator tree
2. Loop nesting depth (exponential weighting, Braun & Hack 2009)
3. SSA liveness with PHI-aware dataflow + exec-mask region extension
4. Divergence-aware spill cost: cost(v) = Σ depth_weight × div_weight
- div_weight = 64 for divergent VGPRs (Wave64 scratch cost)
- div_weight = 1 for uniform VGPRs (readfirstlane to scalar)
- div_weight = 1 for SGPRs (already scalar)
5. Rematerialisation detection (immediate loads → cost 0)
6. SSA coloring: domtree preorder, backward scan, greedy lowest-color
- Precoloring for intra-block interference resolution
- Divergence-weighted spill victim selection on pressure overflow
7. Spill codegen with 4 paths:
A. Remat (0 bytes scratch, 1 instruction)
B. Uniform VGPR: v_readfirstlane → scratch (4 bytes)
C. Divergent VGPR: full per-lane scratch (wave_width × 4 bytes)
D. SGPR: v_mov to relay → scratch (4 bytes)
8. Post-RA phi elimination with free coalescing (same color = no copy)

All static memory (~30 MB), no malloc. ~1,300 lines of C99.
Operates on SSA form before phi elimination — free PHI coalescing.
Fallback to ra_gc/ra_lin for functions exceeding pool limits.

Enabled via: barracuda --ssa-ra

Files
-----
src/amdgpu/ra_ssa.c — allocator implementation ( approx. 1,300 lines)
src/amdgpu/ra_ssa.h — public interface
src/amdgpu/amdgpu.h — vr_divg[] bitvector, shared helpers
src/amdgpu/isel.c — divergence propagation to per-vreg bitvector
src/amdgpu/emit.c — SSA dispatch, un-static shared helpers
src/main.c — --ssa-ra flag

References
----------
Sampaio, D., Souza, R. M. de, Collange, S., & Pereira, F. M. Q. (2013).
Divergence analysis. ACM TOPLAS 35(4), Article 13, 1-36.
https://doi.org/10.1145/2523815

Cooper, K. D., Harvey, T. J., & Kennedy, K. (2001).
A simple, fast dominance algorithm.
Software Practice and Experience, 4, 1-10.

Braun, M., & Hack, S. (2009).
Register spilling and live-range splitting for SSA-form programs.
CC 2009, LNCS 5501, pp. 174-189.
https://doi.org/10.1007/978-3-642-00722-4_13

Yes I used Zotero because I always seem to miss something in apa7th lol.

Next Steps
----------
- Run on MI300X hardware
barracuda --ssa-ra --amdgpu-bin --gfx942 gpu/tp_kern.cu
then: kahu a.hsaco --all
then: run Godiva benchmark, verify k_eff matches CPU (0.995 ± 0.001)
- If kernel runs correctly, benchmark against ra_gc binary
- Expected: significant speedup from 78% scratch reduction
(Sampaio et al. report 26.21% speedup on 395 CUDA kernels)

Test Status
-----------
- 90/91 tests pass (1 skipped, same as before so no change there)
- vector_add: 6 VGPRs, 0 spills, 0 scratch (both RDNA3 and CDNA3)
- Moa kernel: binary generation succeeds (a.hsaco, 56 KB)
- No verifier errors
144 changes: 144 additions & 0 deletions src/amdgpu/amd_rplan.c
Original file line number Diff line number Diff line change
@@ -0,0 +1,144 @@
#include "amdgpu.h"
#include <string.h>
#include <stdio.h>

/*
* Resource planning pass for AMDGPU kernels.
*
* Scans BIR, stamps target-specific constants onto mfunc_t.
* Downstream passes read decisions, never ask questions.
* Modelled on tensix/coarsen.c — same philosophy, different planet.
*
* The GPU hardware has strong opinions and no sense of humour.
* Our comments compensate.
*/

/* ---- Scan Statistics ---- */

typedef struct {
uint32_t tid[3]; /* thread ID uses per dim */
uint32_t bid[3]; /* block ID uses per dim */
uint32_t n_alloca; /* scratch allocations */
uint32_t n_loads;
uint32_t n_stores;
uint32_t n_barr; /* barriers */
uint32_t n_atom; /* atomics */
uint32_t n_shfl; /* warp shuffles */
uint8_t max_dim; /* highest dim used */
uint8_t has_disp; /* uses blockDim/gridDim */
} rp_stat_t;

/* ---- BIR Scan ---- */

/* Walk the BIR for one kernel, count what it uses.
* Like a building inspector, except the building is made of
* register pressure and false confidence. */
static void rp_scan(const bir_module_t *bir, const bir_func_t *F,
rp_stat_t *st)
{
memset(st, 0, sizeof(*st));
int guard = 262144;

for (uint32_t bi = 0; bi < F->num_blocks && guard > 0; bi++, guard--) {
const bir_block_t *B = &bir->blocks[F->first_block + bi];

for (uint32_t ii = 0; ii < B->num_insts && guard > 0; ii++, guard--) {
const bir_inst_t *I = &bir->insts[B->first_inst + ii];
uint32_t dim = I->subop < 3 ? I->subop : 0;

switch (I->op) {
case BIR_THREAD_ID:
st->tid[dim]++;
if (dim > st->max_dim) st->max_dim = (uint8_t)dim;
break;
case BIR_BLOCK_ID:
st->bid[dim]++;
if (dim > st->max_dim) st->max_dim = (uint8_t)dim;
break;
case BIR_BLOCK_DIM:
case BIR_GRID_DIM:
st->has_disp = 1;
if (dim > st->max_dim) st->max_dim = (uint8_t)dim;
break;
case BIR_ALLOCA: st->n_alloca++; break;
case BIR_LOAD: st->n_loads++; break;
case BIR_STORE: st->n_stores++; break;
case BIR_BARRIER:
case BIR_BARRIER_GROUP: st->n_barr++; break;

case BIR_ATOMIC_ADD: case BIR_ATOMIC_SUB:
case BIR_ATOMIC_AND: case BIR_ATOMIC_OR: case BIR_ATOMIC_XOR:
case BIR_ATOMIC_MIN: case BIR_ATOMIC_MAX:
case BIR_ATOMIC_XCHG: case BIR_ATOMIC_CAS:
case BIR_ATOMIC_LOAD: case BIR_ATOMIC_STORE:
st->n_atom++; break;

case BIR_SHFL: case BIR_SHFL_UP:
case BIR_SHFL_DOWN: case BIR_SHFL_XOR:
case BIR_BALLOT: case BIR_VOTE_ANY: case BIR_VOTE_ALL:
st->n_shfl++; break;

default: break;
}
}
}
if (st->max_dim > 2) st->max_dim = 2;
}

/* ---- Target Allocation ---- */

/* Stamp target-specific constants. Every field here is a decision
* that used to be an is_cdna() call scattered across isel and emit.
* Now the argument happens once, up front, and everyone else just
* reads the memo. */
static void rp_alloc(mfunc_t *MF, amd_target_t tgt, const rp_stat_t *st)
{
int cdna = (tgt <= AMD_TARGET_GFX942);

MF->exec_w = cdna ? 1 : 0;
MF->smem_hz = cdna ? 1 : 0;
MF->scr_afs = cdna ? 1 : 0;
MF->rp_pad = 0;
MF->imp_sgp = cdna ? 6 : 0;
MF->sgp_min = cdna ? 2 : 0;
MF->wavefront_size = cdna ? AMD_WAVE64 : AMD_WAVE_SIZE;

/* RSRC1 static mode bits — computed once, OR'd into rsrc1 by emit.
* The per-kernel VGPR/SGPR block fields come from regalloc and
* get combined separately. Target constants and per-kernel
* arithmetic stay in their proper lanes, like civilised traffic. */
MF->r1_mode = (3u << 16) | /* FLOAT_DENORM_32 = preserve */
(3u << 18) | /* FLOAT_DENORM_16_64 = preserve */
(1u << 21) | /* DX10_CLAMP */
(1u << 23); /* IEEE_MODE */
if (!cdna) {
MF->r1_mode |= (1u << 26) | /* WGP_MODE */
(1u << 27); /* MEM_ORDERED */
}

/* Stamp dispatch/dim info from scan */
MF->needs_dispatch = st->has_disp;
MF->max_dim = st->max_dim;
}

/* ---- Public Entry ---- */

void amd_rplan(amd_module_t *A)
{
int guard = 8192;
for (uint32_t fi = 0; fi < A->num_mfuncs && guard > 0; fi++, guard--) {
mfunc_t *MF = &A->mfuncs[fi];
if (!MF->is_kernel) continue;

rp_stat_t st;
rp_scan(A->bir, &A->bir->funcs[MF->bir_func], &st);
rp_alloc(MF, A->target, &st);

const char *name = A->bir->strings + MF->name;
printf(" rplan %s: wave%u, sgp_imp=%u, scratch=%s, "
"%u loads, %u stores\n",
name, MF->wavefront_size, MF->imp_sgp,
st.n_alloca ? "yes" : "no",
st.n_loads, st.n_stores);
}
}
Loading
Loading