nn2 Architecture

nn2 — Pure-C Inference Engine

Source · Apache 2.0 · 520 KB binary · zero dependencies

nn2 runs YOLOv8 detection and MiniFASNet anti-spoof inference 1.5–2× faster than ONNX Runtime on the same x86 CPU, in 520 KB of pure C with no Python, no shared libraries, no GPU. Same source compiles for ARM NEON (Apple Silicon, Raspberry Pi); ESP32-P4 + iMX NPU backends are in PR #3 from @navado.

What's inside

nn2/src/
  gemm.c                — column-panel SGEMM, MR=6 NR=32 inner kernel
  gemm_avx512.S         — hand-written AVX-512 6×32 microkernel
  gemm_int8.c           — INT8 GEMM with vpmaddubsw / vpdpbusd (VNNI)
  gemm_neon.h           — AArch64 NEON fallback
  conv.c                — top-level conv dispatcher
  conv_implicit.c       — implicit im2col for 3×3 s=1 (zero buffer)
  conv_tiled.c          — KC-blocking for cache locality
  conv_fused.c          — fused conv + bias + activation (saves a pass)
  winograd.c            — F(2,3) Winograd for 3×3 convs
  ops.c                 — depthwise conv, SiLU, sigmoid, maxpool, upsample
  antispoof_ops.c       — PReLU, GAP, channel mul (added for MiniFASNet)
  decode.c              — DFL decode + greedy NMS for YOLO outputs
  pool.c                — lock-free thread pool (WaitOnAddress / futex)
  net.c                 — YOLOv8n forward pass (zero-copy C2f)
  minifasnet.c          — MiniFASNet forward (1×1 expand / 3×3 DW / SE)
  nvr_prod.c            — production NVR server with embedded web UI

Performance progression on YOLOv8n @ 320×320

Step	Latency	Speedup
Naive C loop	917.0 ms	1×
AVX2 GEMM	120.0 ms	7.6×
AVX-512 6×32	44.0 ms	2.7×
Fused bias + SiLU	26.0 ms	1.7×
Parallel im2col	16.0 ms	1.6×
A-packing	13.0 ms	1.2×
Implicit im2col	12.6 ms	1.03×
AVX-512 maxpool	8.9 ms	1.4×
KC-blocking + 2D tiling	8.5 ms	1.05×
Fused residual	8.1 ms	1.05×
Final	8.1 ms	113×

vs ONNX Runtime 1.23 on the same machine: 12.7 ms → 8.5 ms = 1.5× faster.

Smart NVR pipeline (motion gate + Kalman tracking): 0.56 ms average per frame across cameras, so one Intel i5-11500 core decodes + detects on ~70 IP cameras simultaneously.

MiniFASNet port

Same approach, different model. We added these AVX-512 ops to support the MobileFaceNet-style architecture of MiniFASNet:

nn2_prelu — per-channel parametric ReLU
nn2_global_avg_pool — for the SE squeeze step
nn2_channel_mul — for SE excitation
nn2_add_inplace — for residual connections
nn2_linear — fully-connected layer with bias
nn2_softmax

Result on the V2 + V1SE ensemble (the production setup):

Engine	Single model	Ensemble	Speedup
nn2	0.70 ms	1.43 ms	–
ONNX Runtime 1.23	1.33 ms	2.92 ms	2.03×

Byte-identical predictions to ONNX and PyTorch on sample images. The 2× gap on MiniFASNet is wider than YOLO because MiniFASNet has more small-matrix ops where ORT's per-op dispatch overhead dominates.

Key SIMD tricks (worth reading the code for)

MR=6 / NR=32 microkernel — the inner GEMM block is 6 rows of A times 32 columns of B accumulated in 12 ZMM registers, occupying half the AVX-512 register file. The other half holds B prefetch.
A-packing at load time — convolution weights are pre-packed into the BLIS-style A panel layout once, at engine init. Inference reads sequentially without any stride/lda gymnastics.
Implicit im2col for 3×3 s=1 — instead of materializing the im2col matrix, the conv kernel loads 9 strided columns directly from the input feature map. Saves the im2col buffer pass entirely on the most common conv shape.
Fused bias+SiLU+residual — the final accumulator pass writes the conv output through bias, activation, and a residual add in a single memory traversal. Cuts L2 round-trips by 3×.
Lock-free thread pool — futex on Linux, WaitOnAddress on Windows. ~10 ns wake latency, scales to all cores cleanly. Each conv channel range is one job.

When to use

Scenario	nn2?
Server-side, x86 CPU only	Yes — straight 1.5–2× over ORT
Browser	Not directly (use `onnxruntime-web`); a WASM port is on the roadmap
Mobile ARM	Yes once the NEON build script lands
ESP32 / iMX edge	In progress via PR #3
NVR / surveillance product	Yes — entire decode + detect + recognize + spoof pipeline in <2 MB

Build

cd nn2
bash build.sh                # YOLOv8 path
bash build_antispoof.sh      # adds MiniFASNet port

Requires GCC 11+ with AVX2 (AVX-512 auto-detected at runtime).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

nn2 Architecture

nn2 — Pure-C Inference Engine

What's inside

Performance progression on YOLOv8n @ 320×320

MiniFASNet port

Key SIMD tricks (worth reading the code for)

When to use

Build

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally