Skip to content

Add A100 PyTorch/CUDA YOLO-NAS benchmark submissions (3 variants)#5

Merged
EHxuban11 merged 1 commit into
mainfrom
benchmark/a100-pytorch-cuda-yolonas
Apr 25, 2026
Merged

Add A100 PyTorch/CUDA YOLO-NAS benchmark submissions (3 variants)#5
EHxuban11 merged 1 commit into
mainfrom
benchmark/a100-pytorch-cuda-yolonas

Conversation

@EHxuban11
Copy link
Copy Markdown
Contributor

3 new submissions for YOLO-NAS on NVIDIA A100-PCIE-40GB (MIG 7g.40gb slice) with torch 2.6.0+cu124. Variants: s, m, l.

Provenance / libreyolo_commit

These runs were executed against libreyolo at commit 3383a8f142a5decc735f362258f6851d3f026fa3 (the 106-add-d-fine-model-family feature branch, which carries the YOLO-NAS port) plus a single small local patch to libreyolo/models/yolonas/utils.py::postprocess. To stay honest about provenance, libreyolo_commit on the 3 submission JSONs is "unknown".

The local patch replaces a per-class Python NMS loop (one torchvision.ops.nms call per surviving class) with a single torchvision.ops.batched_nms call, plus a top-k=1000 pre-NMS filter that mirrors super_gradients' YoloNASPostPredictionCallback default (num_pre_nms_predictions=1000). Without this patch, YOLO-NAS at conf=0.001 (COCO eval default) keeps ~all 8400 anchors past the conf gate and dispatches one small NMS kernel per surviving class, which on A100 MIG measured at ~700 ms/image. Patched: ~60 ms/image, 10x faster.

Verified mAP-neutral: same image, same seeds, mAP_50_95 differs by 0.0004 between patched and unpatched (numerical reordering only). batched_nms with the per-class idxs argument is mathematically identical to the per-class loop, and top-k=1000 matches super_gradients' COCO eval default.

A LibreYOLO upstream issue is open to track moving the per-class loop in libreyolo/utils/general.py::postprocess_detections to batched_nms across the board (helps YOLOX / YOLOv9 / RT-DETR / D-FINE too, just less catastrophically). When that lands, these submissions can be backfilled with the merged libreyolo_commit.

Metadata changes

  • support-matrix.json: append the 3 yolonas-{s,m,l} model ids. No SHA bump (libreyolo_commit is "unknown" so the matrix doesn't gate this).
  • website/src/data/metadata/families.json: add yolonas family (Deci, acquired by NVIDIA; 2023).
  • website/src/data/metadata/models.json: add 3 YOLO-NAS variant entries.

Measured COCO val2017 mAP@50-95 (paper reference in parens):
yolonas-s 0.4645 (~0.475)
yolonas-m 0.5053 (~0.516)
yolonas-l 0.5119 (~0.522)

Local: scripts/validate_submission.py and scripts/build_verified_results.py both exit 0. generated/verified-results.v1.json regenerated and committed.

3 new submissions for YOLO-NAS on NVIDIA A100-PCIE-40GB (MIG 7g.40gb
slice) with torch 2.6.0+cu124. Variants: s, m, l.

Provenance / libreyolo_commit
-----------------------------

These runs were executed against libreyolo at commit
3383a8f142a5decc735f362258f6851d3f026fa3 (the 106-add-d-fine-model-family
feature branch, which carries the YOLO-NAS port) plus a single small
local patch to libreyolo/models/yolonas/utils.py::postprocess. To stay
honest about provenance, libreyolo_commit on the 3 submission JSONs is
"unknown".

The local patch replaces a per-class Python NMS loop (one
torchvision.ops.nms call per surviving class) with a single
torchvision.ops.batched_nms call, plus a top-k=1000 pre-NMS filter that
mirrors super_gradients' YoloNASPostPredictionCallback default
(num_pre_nms_predictions=1000). Without this patch, YOLO-NAS at
conf=0.001 (COCO eval default) keeps ~all 8400 anchors past the conf
gate and dispatches one small NMS kernel per surviving class, which on
A100 MIG measured at ~700 ms/image. Patched: ~60 ms/image, 10x faster.

Verified mAP-neutral: same image, same seeds, mAP_50_95 differs by
0.0004 between patched and unpatched (numerical reordering only).
batched_nms with the per-class idxs argument is mathematically
identical to the per-class loop, and top-k=1000 matches super_gradients'
COCO eval default.

A LibreYOLO upstream issue is open to track moving the per-class loop
in libreyolo/utils/general.py::postprocess_detections to batched_nms
across the board (helps YOLOX / YOLOv9 / RT-DETR / D-FINE too, just
less catastrophically). When that lands, these submissions can be
backfilled with the merged libreyolo_commit.

Metadata changes
----------------
- support-matrix.json: append the 3 yolonas-{s,m,l} model ids. No SHA
  bump (libreyolo_commit is "unknown" so the matrix doesn't gate this).
- website/src/data/metadata/families.json: add yolonas family (Deci,
  acquired by NVIDIA; 2023).
- website/src/data/metadata/models.json: add 3 YOLO-NAS variant entries.

Measured COCO val2017 mAP@50-95 (paper reference in parens):
  yolonas-s  0.4645  (~0.475)
  yolonas-m  0.5053  (~0.516)
  yolonas-l  0.5119  (~0.522)

Local: scripts/validate_submission.py and scripts/build_verified_results.py
both exit 0. generated/verified-results.v1.json regenerated and committed.
@EHxuban11 EHxuban11 merged commit f878a77 into main Apr 25, 2026
1 check passed
EHxuban11 added a commit that referenced this pull request Apr 25, 2026
The first YOLO-NAS submission set (merged in #5) used LibreYOLO's existing
shared letterbox preprocessing, which differs from super_gradients' YOLO-NAS
COCO validation pipeline in two ways the file's own docstring already
flagged ("A later parity pass can tighten this toward the exact SG
processing pipeline"):

- longest-side resize to 640 instead of 636
- top-left padding instead of center padding to 640x640

Closing those two gaps (plus moving the harness's NMS IoU from 0.6 to 0.7
to match super_gradients' default) recovers ~60% of the 1-point gap to
Deci's published numbers.

LibreYOLO upstream issue + working branch (113-yolo-nas-validation-
preprocessing-diverges-from-super_gradients-map-loss) covers the actual
code change. libreyolo_commit on these submissions stays "unknown" until
the fix is merged into LibreYOLO main and a pinned commit is added to
support-matrix.json.

Numbers (COCO val2017, A100 PyTorch FP32):
  yolonas-s  0.4711  (prev 0.4645,  paper 0.475)
  yolonas-m  0.5111  (prev 0.5053,  paper 0.516)
  yolonas-l  0.5184  (prev 0.5119,  paper 0.522)

Remaining ~0.005 gap per variant is consistent across s/m/l and likely
reflects FP16 vs FP32 plus minor cv2 vs PIL interpolation differences.

Local: scripts/validate_submission.py and scripts/build_verified_results.py
both exit 0. generated/verified-results.v1.json regenerated.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant