[ML] Reapply: Run allowlist validation in PyTorch edge pipeline by edsavage · Pull Request #3007 · elastic/ml-cpp

edsavage · 2026-03-22T20:18:46Z

Summary

Reverts [ML] Revert: Run allowlist validation in PyTorch edge pipeline #3005, reapplying the original [ML] Run allowlist validation in PyTorch edge pipeline #2989

Made with Cursor

prodsecmachine · 2026-03-22T20:18:59Z

✅ Snyk checks have passed. No issues have been found so far.

Status	Scan Engine	Critical	High	Medium	Low	Total (0)
✅	Open Source Security	0	0	0	0	0 issues
✅	Licenses	0	0	0	0	0 issues

💻 Catch issues earlier using the plugins for VS Code, JetBrains IDEs, Visual Studio, and Eclipse.

valeriy42

LGTM

edsavage · 2026-03-25T01:25:37Z

buildkite run_pytorch_tests

edsavage · 2026-03-25T02:41:23Z

buildkite run_pytorch_tests

edsavage · 2026-03-25T03:20:54Z

buildkite run_pytorch_tests

edsavage · 2026-03-25T03:33:09Z

buildkite run_pytorch_tests

edsavage · 2026-03-25T03:42:18Z

buildkite run_pytorch_tests

edsavage · 2026-03-25T22:01:30Z

buildkite run_pytorch_tests

…tic#2989)" (elastic#3005) This reverts commit 9cc49ff.

The Linux build/test Docker images don't include Python 3 (it's only used during image builds to compile PyTorch, then dropped in the multi-stage final image). Move the validation to a dedicated pipeline step using a python:3 agent image, triggered only for run_pytorch_tests builds. Made-with: Cursor

The python:3 tag now resolves to Python 3.14, which doesn't have torch==2.7.1 wheels. Pin to python:3.12 to match the PyTorch version we build and ship against. Made-with: Cursor

The step was being killed (exit -1) with no output — likely OOM or disk exhaustion from installing torch (800MB+) and tracing 27+ models. Add memory (16G), ephemeral storage (20G), and a 60-minute timeout. Remove -q from pip install so progress is visible in logs. Made-with: Cursor

The validation step should fail the build if it detects allowlist errors — that's the whole point of running it. The upload step retains soft_fail in case of pipeline upload issues. Made-with: Cursor

Without a notify/github_commit_status block, the step doesn't appear as a check on the GitHub PR. Add it so the validation result is visible alongside the other build/test checks. Made-with: Cursor

Private Elastic models on HuggingFace (elastic/elser-v2, etc.) can't be downloaded without a HF_TOKEN, causing the validation step to fail in CI even though the ops are correct. Change validate_model() to return "pass"/"fail"/"skip" — load/trace failures are reported as skips (warnings) while op validation failures remain hard failures. Also pass auto_class and config_overrides through to support BART and QA models. Made-with: Cursor

edsavage · 2026-03-25T22:51:07Z

buildkite run_pytorch_tests

facebook/bart-large-mnli is 1.63GB — loading it into memory for tracing after torch and 30 other models exhausted the 16GB limit. Bump to 32GB memory and 30GB ephemeral storage. Made-with: Cursor

When validating 30+ models sequentially, the HF model weights accumulate in memory. Explicitly delete the original model, tokenizer, and inputs after tracing, and gc.collect() after each validation to release memory promptly. This should allow the validation step to complete within 32GB for the full model set including facebook/bart-large-mnli (1.63GB). Made-with: Cursor

edsavage · 2026-03-25T23:51:58Z

buildkite run_pytorch_tests

valeriy42 approved these changes Mar 23, 2026

View reviewed changes

edsavage force-pushed the revert/pr-3005 branch 3 times, most recently from 592e78c to e0b3d61 Compare March 25, 2026 00:15

edsavage added >build >non-issue :ml v9.4.0 labels Mar 25, 2026

edsavage added 7 commits March 26, 2026 11:49

Reapply "[ML] Run allowlist validation in PyTorch edge pipeline (elas…

4a83394

…tic#2989)" (elastic#3005) This reverts commit 9cc49ff.

[ML] Pin validation step to python:3.12 for torch 2.7.1 compatibility

821afc7

The python:3 tag now resolves to Python 3.14, which doesn't have torch==2.7.1 wheels. Pin to python:3.12 to match the PyTorch version we build and ship against. Made-with: Cursor

[ML] Make PyTorch allowlist validation a hard failure

26715e1

The validation step should fail the build if it detects allowlist errors — that's the whole point of running it. The upload step retains soft_fail in case of pipeline upload issues. Made-with: Cursor

[ML] Add GitHub commit status for PyTorch validation step

9053414

Without a notify/github_commit_status block, the step doesn't appear as a check on the GitHub PR. Add it so the validation result is visible alongside the other build/test checks. Made-with: Cursor

edsavage force-pushed the revert/pr-3005 branch from ef27199 to e1ff7da Compare March 25, 2026 22:50

edsavage added 2 commits March 26, 2026 12:47

[ML] Increase validation step resources for large models

cbc8201

facebook/bart-large-mnli is 1.63GB — loading it into memory for tracing after torch and 30 other models exhausted the 16GB limit. Bump to 32GB memory and 30GB ephemeral storage. Made-with: Cursor

edsavage merged commit 48a1e66 into elastic:main Mar 26, 2026
13 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ML] Reapply: Run allowlist validation in PyTorch edge pipeline#3007

[ML] Reapply: Run allowlist validation in PyTorch edge pipeline#3007
edsavage merged 9 commits intoelastic:mainfrom
edsavage:revert/pr-3005

edsavage commented Mar 22, 2026

Uh oh!

prodsecmachine commented Mar 22, 2026 •

edited

Loading

Uh oh!

valeriy42 left a comment

Uh oh!

edsavage commented Mar 25, 2026

Uh oh!

edsavage commented Mar 25, 2026

Uh oh!

edsavage commented Mar 25, 2026

Uh oh!

edsavage commented Mar 25, 2026

Uh oh!

edsavage commented Mar 25, 2026

Uh oh!

edsavage commented Mar 25, 2026

Uh oh!

edsavage commented Mar 25, 2026

Uh oh!

edsavage commented Mar 25, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

edsavage commented Mar 22, 2026

Summary

Uh oh!

prodsecmachine commented Mar 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ Snyk checks have passed. No issues have been found so far.

Uh oh!

valeriy42 left a comment

Choose a reason for hiding this comment

Uh oh!

edsavage commented Mar 25, 2026

Uh oh!

edsavage commented Mar 25, 2026

Uh oh!

edsavage commented Mar 25, 2026

Uh oh!

edsavage commented Mar 25, 2026

Uh oh!

edsavage commented Mar 25, 2026

Uh oh!

edsavage commented Mar 25, 2026

Uh oh!

edsavage commented Mar 25, 2026

Uh oh!

edsavage commented Mar 25, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

prodsecmachine commented Mar 22, 2026 •

edited

Loading