- S reamlined
- T ensor
- E xecution
- E ngine
- L ibrary
STEEL is a fun experiment writing essentially pure C++ to try to get to the most elegant tensor/ML library I can get while understanding how these are implemented, from the math, to cache aware efficient algorithims, to good ml practices.
cmake -S . -B build
cmake --build build -j$(nproc)build/matrix_bench— low-level matrix multiplication benchmarkbuild/qwen_infer— interactive Qwen2 inferencebuild/steel_bench— inference benchmark (prefill / decode / end-to-end)build/steel_tests— unit tests
./build/steel_bench --model qwen2.5-0.5b-instruct-fp16.gguf --threads 8Options:
| Flag | Default | Description |
|---|---|---|
--model |
qwen2.5-0.5b-instruct-fp16.gguf |
Path to GGUF model |
--threads |
auto | CPU threads |
--decode-tokens |
64 | Tokens to generate per decode test |
--warmup |
1 | Warmup iterations |
--iters |
3 | Benchmark iterations (use 5 to match llama-bench) |
Reports mean ± stddev and best tok/s for prefill, decode, and end-to-end generation.