Skip to content

nn2 Architecture

Baurzhan Atinov edited this page May 14, 2026 · 1 revision

nn2 — Pure-C Inference Engine

Source · Apache 2.0 · 520 KB binary · zero dependencies

nn2 runs YOLOv8 detection and MiniFASNet anti-spoof inference 1.5–2× faster than ONNX Runtime on the same x86 CPU, in 520 KB of pure C with no Python, no shared libraries, no GPU. Same source compiles for ARM NEON (Apple Silicon, Raspberry Pi); ESP32-P4 + iMX NPU backends are in PR #3 from @navado.


What's inside

nn2/src/
  gemm.c                — column-panel SGEMM, MR=6 NR=32 inner kernel
  gemm_avx512.S         — hand-written AVX-512 6×32 microkernel
  gemm_int8.c           — INT8 GEMM with vpmaddubsw / vpdpbusd (VNNI)
  gemm_neon.h           — AArch64 NEON fallback
  conv.c                — top-level conv dispatcher
  conv_implicit.c       — implicit im2col for 3×3 s=1 (zero buffer)
  conv_tiled.c          — KC-blocking for cache locality
  conv_fused.c          — fused conv + bias + activation (saves a pass)
  winograd.c            — F(2,3) Winograd for 3×3 convs
  ops.c                 — depthwise conv, SiLU, sigmoid, maxpool, upsample
  antispoof_ops.c       — PReLU, GAP, channel mul (added for MiniFASNet)
  decode.c              — DFL decode + greedy NMS for YOLO outputs
  pool.c                — lock-free thread pool (WaitOnAddress / futex)
  net.c                 — YOLOv8n forward pass (zero-copy C2f)
  minifasnet.c          — MiniFASNet forward (1×1 expand / 3×3 DW / SE)
  nvr_prod.c            — production NVR server with embedded web UI

Performance progression on YOLOv8n @ 320×320

Step Latency Speedup
Naive C loop 917.0 ms
AVX2 GEMM 120.0 ms 7.6×
AVX-512 6×32 44.0 ms 2.7×
Fused bias + SiLU 26.0 ms 1.7×
Parallel im2col 16.0 ms 1.6×
A-packing 13.0 ms 1.2×
Implicit im2col 12.6 ms 1.03×
AVX-512 maxpool 8.9 ms 1.4×
KC-blocking + 2D tiling 8.5 ms 1.05×
Fused residual 8.1 ms 1.05×
Final 8.1 ms 113×

vs ONNX Runtime 1.23 on the same machine: 12.7 ms → 8.5 ms = 1.5× faster.

Smart NVR pipeline (motion gate + Kalman tracking): 0.56 ms average per frame across cameras, so one Intel i5-11500 core decodes + detects on ~70 IP cameras simultaneously.


MiniFASNet port

Same approach, different model. We added these AVX-512 ops to support the MobileFaceNet-style architecture of MiniFASNet:

  • nn2_prelu — per-channel parametric ReLU
  • nn2_global_avg_pool — for the SE squeeze step
  • nn2_channel_mul — for SE excitation
  • nn2_add_inplace — for residual connections
  • nn2_linear — fully-connected layer with bias
  • nn2_softmax

Result on the V2 + V1SE ensemble (the production setup):

Engine Single model Ensemble Speedup
nn2 0.70 ms 1.43 ms
ONNX Runtime 1.23 1.33 ms 2.92 ms 2.03×

Byte-identical predictions to ONNX and PyTorch on sample images. The 2× gap on MiniFASNet is wider than YOLO because MiniFASNet has more small-matrix ops where ORT's per-op dispatch overhead dominates.


Key SIMD tricks (worth reading the code for)

  1. MR=6 / NR=32 microkernel — the inner GEMM block is 6 rows of A times 32 columns of B accumulated in 12 ZMM registers, occupying half the AVX-512 register file. The other half holds B prefetch.

  2. A-packing at load time — convolution weights are pre-packed into the BLIS-style A panel layout once, at engine init. Inference reads sequentially without any stride/lda gymnastics.

  3. Implicit im2col for 3×3 s=1 — instead of materializing the im2col matrix, the conv kernel loads 9 strided columns directly from the input feature map. Saves the im2col buffer pass entirely on the most common conv shape.

  4. Fused bias+SiLU+residual — the final accumulator pass writes the conv output through bias, activation, and a residual add in a single memory traversal. Cuts L2 round-trips by 3×.

  5. Lock-free thread pool — futex on Linux, WaitOnAddress on Windows. ~10 ns wake latency, scales to all cores cleanly. Each conv channel range is one job.


When to use

Scenario nn2?
Server-side, x86 CPU only Yes — straight 1.5–2× over ORT
Browser Not directly (use onnxruntime-web); a WASM port is on the roadmap
Mobile ARM Yes once the NEON build script lands
ESP32 / iMX edge In progress via PR #3
NVR / surveillance product Yes — entire decode + detect + recognize + spoof pipeline in <2 MB

Build

cd nn2
bash build.sh                # YOLOv8 path
bash build_antispoof.sh      # adds MiniFASNet port

Requires GCC 11+ with AVX2 (AVX-512 auto-detected at runtime).

Clone this wiki locally