Skip to content

[ML] Reapply: Run allowlist validation in PyTorch edge pipeline#3007

Merged
edsavage merged 9 commits intoelastic:mainfrom
edsavage:revert/pr-3005
Mar 26, 2026
Merged

[ML] Reapply: Run allowlist validation in PyTorch edge pipeline#3007
edsavage merged 9 commits intoelastic:mainfrom
edsavage:revert/pr-3005

Conversation

@edsavage
Copy link
Contributor

@prodsecmachine
Copy link

prodsecmachine commented Mar 22, 2026

Snyk checks have passed. No issues have been found so far.

Status Scan Engine Critical High Medium Low Total (0)
Open Source Security 0 0 0 0 0 issues
Licenses 0 0 0 0 0 issues

💻 Catch issues earlier using the plugins for VS Code, JetBrains IDEs, Visual Studio, and Eclipse.

Copy link
Contributor

@valeriy42 valeriy42 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@edsavage edsavage force-pushed the revert/pr-3005 branch 3 times, most recently from 592e78c to e0b3d61 Compare March 25, 2026 00:15
@edsavage
Copy link
Contributor Author

buildkite run_pytorch_tests

@edsavage
Copy link
Contributor Author

buildkite run_pytorch_tests

4 similar comments
@edsavage
Copy link
Contributor Author

buildkite run_pytorch_tests

@edsavage
Copy link
Contributor Author

buildkite run_pytorch_tests

@edsavage
Copy link
Contributor Author

buildkite run_pytorch_tests

@edsavage
Copy link
Contributor Author

buildkite run_pytorch_tests

The Linux build/test Docker images don't include Python 3 (it's only
used during image builds to compile PyTorch, then dropped in the
multi-stage final image). Move the validation to a dedicated pipeline
step using a python:3 agent image, triggered only for
run_pytorch_tests builds.

Made-with: Cursor
The python:3 tag now resolves to Python 3.14, which doesn't have
torch==2.7.1 wheels. Pin to python:3.12 to match the PyTorch
version we build and ship against.

Made-with: Cursor
The step was being killed (exit -1) with no output — likely OOM or
disk exhaustion from installing torch (800MB+) and tracing 27+ models.

Add memory (16G), ephemeral storage (20G), and a 60-minute timeout.
Remove -q from pip install so progress is visible in logs.

Made-with: Cursor
The validation step should fail the build if it detects allowlist
errors — that's the whole point of running it. The upload step
retains soft_fail in case of pipeline upload issues.

Made-with: Cursor
Without a notify/github_commit_status block, the step doesn't
appear as a check on the GitHub PR. Add it so the validation
result is visible alongside the other build/test checks.

Made-with: Cursor
Private Elastic models on HuggingFace (elastic/elser-v2, etc.) can't
be downloaded without a HF_TOKEN, causing the validation step to fail
in CI even though the ops are correct.

Change validate_model() to return "pass"/"fail"/"skip" — load/trace
failures are reported as skips (warnings) while op validation
failures remain hard failures. Also pass auto_class and
config_overrides through to support BART and QA models.

Made-with: Cursor
@edsavage
Copy link
Contributor Author

buildkite run_pytorch_tests

facebook/bart-large-mnli is 1.63GB — loading it into memory for
tracing after torch and 30 other models exhausted the 16GB limit.
Bump to 32GB memory and 30GB ephemeral storage.

Made-with: Cursor
When validating 30+ models sequentially, the HF model weights
accumulate in memory. Explicitly delete the original model,
tokenizer, and inputs after tracing, and gc.collect() after each
validation to release memory promptly. This should allow the
validation step to complete within 32GB for the full model set
including facebook/bart-large-mnli (1.63GB).

Made-with: Cursor
@edsavage
Copy link
Contributor Author

buildkite run_pytorch_tests

@edsavage edsavage merged commit 48a1e66 into elastic:main Mar 26, 2026
13 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants