Add A100 PyTorch/CUDA YOLO-NAS benchmark submissions (3 variants)#5
Merged
Merged
Conversation
3 new submissions for YOLO-NAS on NVIDIA A100-PCIE-40GB (MIG 7g.40gb
slice) with torch 2.6.0+cu124. Variants: s, m, l.
Provenance / libreyolo_commit
-----------------------------
These runs were executed against libreyolo at commit
3383a8f142a5decc735f362258f6851d3f026fa3 (the 106-add-d-fine-model-family
feature branch, which carries the YOLO-NAS port) plus a single small
local patch to libreyolo/models/yolonas/utils.py::postprocess. To stay
honest about provenance, libreyolo_commit on the 3 submission JSONs is
"unknown".
The local patch replaces a per-class Python NMS loop (one
torchvision.ops.nms call per surviving class) with a single
torchvision.ops.batched_nms call, plus a top-k=1000 pre-NMS filter that
mirrors super_gradients' YoloNASPostPredictionCallback default
(num_pre_nms_predictions=1000). Without this patch, YOLO-NAS at
conf=0.001 (COCO eval default) keeps ~all 8400 anchors past the conf
gate and dispatches one small NMS kernel per surviving class, which on
A100 MIG measured at ~700 ms/image. Patched: ~60 ms/image, 10x faster.
Verified mAP-neutral: same image, same seeds, mAP_50_95 differs by
0.0004 between patched and unpatched (numerical reordering only).
batched_nms with the per-class idxs argument is mathematically
identical to the per-class loop, and top-k=1000 matches super_gradients'
COCO eval default.
A LibreYOLO upstream issue is open to track moving the per-class loop
in libreyolo/utils/general.py::postprocess_detections to batched_nms
across the board (helps YOLOX / YOLOv9 / RT-DETR / D-FINE too, just
less catastrophically). When that lands, these submissions can be
backfilled with the merged libreyolo_commit.
Metadata changes
----------------
- support-matrix.json: append the 3 yolonas-{s,m,l} model ids. No SHA
bump (libreyolo_commit is "unknown" so the matrix doesn't gate this).
- website/src/data/metadata/families.json: add yolonas family (Deci,
acquired by NVIDIA; 2023).
- website/src/data/metadata/models.json: add 3 YOLO-NAS variant entries.
Measured COCO val2017 mAP@50-95 (paper reference in parens):
yolonas-s 0.4645 (~0.475)
yolonas-m 0.5053 (~0.516)
yolonas-l 0.5119 (~0.522)
Local: scripts/validate_submission.py and scripts/build_verified_results.py
both exit 0. generated/verified-results.v1.json regenerated and committed.
EHxuban11
added a commit
that referenced
this pull request
Apr 25, 2026
The first YOLO-NAS submission set (merged in #5) used LibreYOLO's existing shared letterbox preprocessing, which differs from super_gradients' YOLO-NAS COCO validation pipeline in two ways the file's own docstring already flagged ("A later parity pass can tighten this toward the exact SG processing pipeline"): - longest-side resize to 640 instead of 636 - top-left padding instead of center padding to 640x640 Closing those two gaps (plus moving the harness's NMS IoU from 0.6 to 0.7 to match super_gradients' default) recovers ~60% of the 1-point gap to Deci's published numbers. LibreYOLO upstream issue + working branch (113-yolo-nas-validation- preprocessing-diverges-from-super_gradients-map-loss) covers the actual code change. libreyolo_commit on these submissions stays "unknown" until the fix is merged into LibreYOLO main and a pinned commit is added to support-matrix.json. Numbers (COCO val2017, A100 PyTorch FP32): yolonas-s 0.4711 (prev 0.4645, paper 0.475) yolonas-m 0.5111 (prev 0.5053, paper 0.516) yolonas-l 0.5184 (prev 0.5119, paper 0.522) Remaining ~0.005 gap per variant is consistent across s/m/l and likely reflects FP16 vs FP32 plus minor cv2 vs PIL interpolation differences. Local: scripts/validate_submission.py and scripts/build_verified_results.py both exit 0. generated/verified-results.v1.json regenerated.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
3 new submissions for YOLO-NAS on NVIDIA A100-PCIE-40GB (MIG 7g.40gb slice) with torch 2.6.0+cu124. Variants: s, m, l.
Provenance / libreyolo_commit
These runs were executed against libreyolo at commit 3383a8f142a5decc735f362258f6851d3f026fa3 (the 106-add-d-fine-model-family feature branch, which carries the YOLO-NAS port) plus a single small local patch to libreyolo/models/yolonas/utils.py::postprocess. To stay honest about provenance, libreyolo_commit on the 3 submission JSONs is "unknown".
The local patch replaces a per-class Python NMS loop (one torchvision.ops.nms call per surviving class) with a single torchvision.ops.batched_nms call, plus a top-k=1000 pre-NMS filter that mirrors super_gradients' YoloNASPostPredictionCallback default (num_pre_nms_predictions=1000). Without this patch, YOLO-NAS at conf=0.001 (COCO eval default) keeps ~all 8400 anchors past the conf gate and dispatches one small NMS kernel per surviving class, which on A100 MIG measured at ~700 ms/image. Patched: ~60 ms/image, 10x faster.
Verified mAP-neutral: same image, same seeds, mAP_50_95 differs by 0.0004 between patched and unpatched (numerical reordering only). batched_nms with the per-class idxs argument is mathematically identical to the per-class loop, and top-k=1000 matches super_gradients' COCO eval default.
A LibreYOLO upstream issue is open to track moving the per-class loop in libreyolo/utils/general.py::postprocess_detections to batched_nms across the board (helps YOLOX / YOLOv9 / RT-DETR / D-FINE too, just less catastrophically). When that lands, these submissions can be backfilled with the merged libreyolo_commit.
Metadata changes
Measured COCO val2017 mAP@50-95 (paper reference in parens):
yolonas-s 0.4645 (~0.475)
yolonas-m 0.5053 (~0.516)
yolonas-l 0.5119 (~0.522)
Local: scripts/validate_submission.py and scripts/build_verified_results.py both exit 0. generated/verified-results.v1.json regenerated and committed.