Self-trained Apache-licensed face recognition weights for the FaceX runtime. Four size points to fit different hardware budgets, all sharing the same parametric C engine.
| file | params | INT8 size | FP32 size | MS1M train acc |
|---|---|---|---|---|
weights/facex_nano.bin |
199K | ~200 KB | 827 KB | 15.4% |
weights/facex_tiny.bin |
452K | ~450 KB | 1.86 MB | 25.1% |
weights/facex_standard.bin |
968K | ~1 MB | 3.95 MB | 38.2% |
weights/facex_xs.bin |
2.08M | ~2 MB | 8.42 MB | 50.9% |
All trained from scratch on MS1M-RefineV2 (5.82M images, 85,742 IDs) with ArcFace (s=64, m=0.5), AdamW lr=1e-3 cosine, fp32. No upstream pretrained weights — fully Apache-licensed.
Each is a MobileFaceNet (Chen et al. 2018) variant scaled by a width multiplier. Topology:
Input 3x112x112
-> Stem Conv 3x3 s=2 + BN + PReLU (-> 56x56)
-> DW Conv 3x3 s=1 + BN + PReLU
-> Stage 1: 5x InvertedResidual t=2, s=2 first (-> 28x28)
-> Stage 2: 1x InvertedResidual t=4 s=2 (-> 14x14)
-> Stage 3: 6x InvertedResidual t=2 s=1
-> Stage 4: 1x InvertedResidual t=4 s=2 (-> 7x7)
-> Stage 5: 2x InvertedResidual t=2 s=1
-> Conv 1x1 + BN + PReLU (-> final_c)
-> DW Conv 7x7 + BN (linear GDConv) (-> 1x1)
-> Conv 1x1 + BN (-> emb_dim)
-> L2 normalize
Width multipliers used: nano=0.36, tiny=0.55, standard=0.90, xs=1.35. Embedding dim: 256 for nano, 512 for the rest.
Self-describing: a binary header (~80 bytes) names the stage shapes and a JSON copy follows it for debugging. The engine reads only the binary header.
"EFM3" (4 bytes)
version u32 = 3
arch_header (80 bytes — see binformat.py)
json_len u32 + JSON
n_tensors u32
[u32 size + FP32 bytes] x n_tensors
Tensor order is fixed by binformat.tensor_layout(arch) and is the
contract between export_bin.py and the C engine.
src/facex_mfn.c — single-file parametric engine.
- Loads any of the 4 .bin files based on the embedded arch header.
- BatchNorm folded into the preceding conv at load time.
- AVX2 fast paths for 1x1 conv (the bulk of MFN compute), 3x3 DW, GDConv, PReLU, residual add. Plain-C fallback for stem.
- Single-threaded.
make mfn-cli # standalone diagnostic CLI
make mfn-example # tiny "embed + similarity" demo#include "facex_mfn.h"
MfnEngine engine;
mfn_engine_init("weights/facex_standard.bin", &engine);
int D = mfn_embedding_dim(&engine); // 256 (nano) or 512 (others)
float emb[512];
mfn_engine_forward(&engine, input_chw, emb);
// input_chw: [3 * 112 * 112] fp32, values in [-1, 1], CHW layout
float sim = mfn_similarity(emb_a, emb_b, D);
// > 0.3 typically = same person
mfn_engine_free(&engine);Same as InsightFace: 112×112 RGB, aligned (5-point), (pixel - 127.5) / 128,
CHW layout. You need an external detector (e.g. YuNet bundled in
weights/yunet_*.onnx) to align faces before feeding them in.
cd training/scripts
python verify_bin.py --arch standard \
--bin ../../weights/facex_standard.bin \
--ckpt ../runs/standard/last.ptRuns the .bin file through a numpy reference implementation of the same op-graph as the C engine and compares to the PyTorch model. Max expected error ~1e-5 (round-trip from fp32 file).
All four shipped models passed verification on commit:
nano: max_err = 1.28e-05
tiny: max_err = 1.64e-06
standard: max_err = 6.38e-06
xs: max_err = 3.04e-06
See training/README.md and training/RESUME.md for the dataset prep
- training pipeline. Realistic per-arch wall-time on a single RTX 5060 Ti (16 GB), fp32 training, ArcFace + MS1M:
| arch | epochs | per epoch | total |
|---|---|---|---|
| nano | 40 | ~37 min | ~25 h |
| tiny | 35 | ~50 min | ~29 h |
| standard | 30 | ~80 min | ~40 h |
| xs | 30 | ~130 min | ~64 h |