Skip to content

Training Recipes

Baurzhan Atinov edited this page May 14, 2026 · 1 revision

Training Recipes

Every model shipped in the demo was trained on a single RTX 5060 Ti (16 GB) in pure PyTorch with bf16 autocast. Recipes below are the exact commands used to produce the checkpoints in production.

All scripts live under training/.


1. Recognition — MobileFaceNet × 4 variants

Standard MobileFaceNet (Chen et al. 2018) topology, width-scaled to four sizes, ArcFace head with the numerically-stable angle-addition margin.

cd training/scripts

# Prepare MS1M-RefineV2 (already pre-decoded into a raw blob, see
# prepare_lfw.py + the dataset.py loader)
python train.py --arch nano     --epochs 25 --batch 512 --lr 1e-3
python train.py --arch tiny     --epochs 25 --batch 384 --lr 1e-3
python train.py --arch standard --epochs 25 --batch 256 --lr 1e-3
python train.py --arch xs       --epochs 25 --batch 192 --lr 1e-3
Variant Params LFW (after YuNet 5-pt alignment) Time
nano 0.20 M 95.62% ~6 h
tiny 0.45 M 96.85% ~8 h
standard 0.93 M 98.25% ~10 h
xs 2.07 M 99.07% ~14 h

Eval against the LFW 6,000-pair protocol after each epoch:

python lfw_eval.py --ckpt runs/xs/best.pt --pairs lfw_pairs.txt

2. Face detector — own FCOS on WIDER FACE

cd training/face_detect/scripts

python prepare_wider.py    # downloads WIDER_train + WIDER_val (2.5 GB)
python train.py --epochs 80 --batch 32 --input-size 320
python export.py           # → wasm/facex_detect.onnx (~400 KB)

100 K params, FCOS-style anchor-free heads at strides 8/16/32. Best recall in production: 0.275 on the full WIDER FACE val (the metric includes 4-pixel faces our 320×320 input simply can't see; in webcam use the practical recall is ~95%).


3. 98-point landmarks — WFLW

cd training/landmark/scripts
python prepare_wflw.py
python train.py --epochs 60 --batch 64
python export_lm.py

1.15 M params, MobileFaceNet-style backbone with a dense head. NME ~4.85% on the WFLW test split.


4. 576-point 3D mesh — MediaPipe distillation

cd training/landmark3d/scripts
python pre_decode.py        # pre-render labels via mediapipe
python train.py --epochs 40 --batch 128

Final error: xy 0.54 px, z 0.51 (normalized) on held-out val. Rendering uses the WFLW 98-pt model to drive TPS over the 478 MP points → 576 visible mesh points in the demo.


5. Anti-spoof — port MiniFASNet to nn2

We don't retrain the anti-spoof, just convert the upstream MinivisionAI weights (Apache 2.0) and run them through our own nn2 inference path at ~2× ONNX Runtime speed.

cd training/Silent-Face-Anti-Spoofing
python convert_to_onnx.py   # → wasm/minifasnet_v2_27.onnx + v1se_40.onnx

cd ../nn2/tools
python export_minifasnet.py --variant v2   --ckpt ...  --output ../weights/minifasnet_v2_27.bin
python export_minifasnet.py --variant v1se --ckpt ...  --output ../weights/minifasnet_v1se_40.bin

# Build the nn2 antispoof binary and benchmark
cd .. && bash build_antispoof.sh
./nn2_antispoof.exe v2 weights/minifasnet_v2_27.bin tests/test_27.bin 200

6. Smile classifier

The recipe also applies to any tiny binary face attribute: smile, glasses, hat, etc.

cd training/smile/scripts

# Scrape ~500-1500 positives (Bing/DDGS image search via ddgs)
python scrape_positives.py

# Build dataset: positives + 3000 MS1M as negatives
python build_dataset.py

# Train TongueNet (47 K params, MobileNetV2-lite)
python train.py --epochs 30 --batch 128 --lr 1e-3
python export.py           # → wasm/facex_smile.onnx (187 KB)

Expected: F1 0.93–0.96 after 20 epochs (~10 min on 5060 Ti). Don't use class weights — they bias the model toward the rare class and produce constant-100% predictions on neutral webcam frames.


Encrypt + ship to demo

cd wasm
python tools/encrypt_models.py    # AES-256-GCM all .onnx → .enc
cp *.enc ../docs/demo/            # GitHub Pages serves from /docs

The key is written to .model_key at repo root (gitignored). To rotate, delete the file and re-encrypt — clients will need the new key bytes (see Encrypted Weights).

Clone this wiki locally