-
Notifications
You must be signed in to change notification settings - Fork 30
nn2 Architecture
Source · Apache 2.0 · 520 KB binary · zero dependencies
nn2 runs YOLOv8 detection and MiniFASNet anti-spoof inference 1.5–2×
faster than ONNX Runtime on the same x86 CPU, in 520 KB of pure C with
no Python, no shared libraries, no GPU. Same source compiles for ARM
NEON (Apple Silicon, Raspberry Pi); ESP32-P4 + iMX NPU backends are in
PR #3 from @navado.
nn2/src/
gemm.c — column-panel SGEMM, MR=6 NR=32 inner kernel
gemm_avx512.S — hand-written AVX-512 6×32 microkernel
gemm_int8.c — INT8 GEMM with vpmaddubsw / vpdpbusd (VNNI)
gemm_neon.h — AArch64 NEON fallback
conv.c — top-level conv dispatcher
conv_implicit.c — implicit im2col for 3×3 s=1 (zero buffer)
conv_tiled.c — KC-blocking for cache locality
conv_fused.c — fused conv + bias + activation (saves a pass)
winograd.c — F(2,3) Winograd for 3×3 convs
ops.c — depthwise conv, SiLU, sigmoid, maxpool, upsample
antispoof_ops.c — PReLU, GAP, channel mul (added for MiniFASNet)
decode.c — DFL decode + greedy NMS for YOLO outputs
pool.c — lock-free thread pool (WaitOnAddress / futex)
net.c — YOLOv8n forward pass (zero-copy C2f)
minifasnet.c — MiniFASNet forward (1×1 expand / 3×3 DW / SE)
nvr_prod.c — production NVR server with embedded web UI
| Step | Latency | Speedup |
|---|---|---|
| Naive C loop | 917.0 ms | 1× |
| AVX2 GEMM | 120.0 ms | 7.6× |
| AVX-512 6×32 | 44.0 ms | 2.7× |
| Fused bias + SiLU | 26.0 ms | 1.7× |
| Parallel im2col | 16.0 ms | 1.6× |
| A-packing | 13.0 ms | 1.2× |
| Implicit im2col | 12.6 ms | 1.03× |
| AVX-512 maxpool | 8.9 ms | 1.4× |
| KC-blocking + 2D tiling | 8.5 ms | 1.05× |
| Fused residual | 8.1 ms | 1.05× |
| Final | 8.1 ms | 113× |
vs ONNX Runtime 1.23 on the same machine: 12.7 ms → 8.5 ms = 1.5× faster.
Smart NVR pipeline (motion gate + Kalman tracking): 0.56 ms average per frame across cameras, so one Intel i5-11500 core decodes + detects on ~70 IP cameras simultaneously.
Same approach, different model. We added these AVX-512 ops to support the MobileFaceNet-style architecture of MiniFASNet:
-
nn2_prelu— per-channel parametric ReLU -
nn2_global_avg_pool— for the SE squeeze step -
nn2_channel_mul— for SE excitation -
nn2_add_inplace— for residual connections -
nn2_linear— fully-connected layer with bias nn2_softmax
Result on the V2 + V1SE ensemble (the production setup):
| Engine | Single model | Ensemble | Speedup |
|---|---|---|---|
| nn2 | 0.70 ms | 1.43 ms | – |
| ONNX Runtime 1.23 | 1.33 ms | 2.92 ms | 2.03× |
Byte-identical predictions to ONNX and PyTorch on sample images. The 2× gap on MiniFASNet is wider than YOLO because MiniFASNet has more small-matrix ops where ORT's per-op dispatch overhead dominates.
-
MR=6 / NR=32 microkernel — the inner GEMM block is 6 rows of A times 32 columns of B accumulated in 12 ZMM registers, occupying half the AVX-512 register file. The other half holds B prefetch.
-
A-packing at load time — convolution weights are pre-packed into the BLIS-style A panel layout once, at engine init. Inference reads sequentially without any stride/lda gymnastics.
-
Implicit im2col for 3×3 s=1 — instead of materializing the im2col matrix, the conv kernel loads 9 strided columns directly from the input feature map. Saves the im2col buffer pass entirely on the most common conv shape.
-
Fused bias+SiLU+residual — the final accumulator pass writes the conv output through bias, activation, and a residual add in a single memory traversal. Cuts L2 round-trips by 3×.
-
Lock-free thread pool — futex on Linux, WaitOnAddress on Windows. ~10 ns wake latency, scales to all cores cleanly. Each conv channel range is one job.
| Scenario | nn2? |
|---|---|
| Server-side, x86 CPU only | Yes — straight 1.5–2× over ORT |
| Browser | Not directly (use onnxruntime-web); a WASM port is on the roadmap |
| Mobile ARM | Yes once the NEON build script lands |
| ESP32 / iMX edge | In progress via PR #3 |
| NVR / surveillance product | Yes — entire decode + detect + recognize + spoof pipeline in <2 MB |
cd nn2
bash build.sh # YOLOv8 path
bash build_antispoof.sh # adds MiniFASNet portRequires GCC 11+ with AVX2 (AVX-512 auto-detected at runtime).