Skip to content

[ML] Add ML_SKIP_MODEL_VALIDATION bypass for graph validation#3013

Open
edsavage wants to merge 3 commits intoelastic:mainfrom
edsavage:feature/model-validation-kill-switch
Open

[ML] Add ML_SKIP_MODEL_VALIDATION bypass for graph validation#3013
edsavage wants to merge 3 commits intoelastic:mainfrom
edsavage:feature/model-validation-kill-switch

Conversation

@edsavage
Copy link
Copy Markdown
Contributor

@edsavage edsavage commented Mar 26, 2026

Summary

  • Adds an environment variable escape hatch to bypass TorchScript model graph validation
  • When ML_SKIP_MODEL_VALIDATION=true is set in the process environment before pytorch_inference starts, the allowlist check is skipped and a warning is logged
  • Provides a zero-rebuild way to disable validation in an emergency — an operator can set the env var in the deployment configuration (systemd, Docker, Kubernetes pod spec) without needing a new ml-cpp build or Elasticsearch release
  • Default behaviour (validation enabled) is unchanged
  • Only the exact value "true" activates the bypass; any other value or unset means validation runs normally

Test plan

  • Built and ran CModelGraphValidatorTest suite locally — all tests pass
  • Integration test: ML_SKIP_MODEL_VALIDATION=true bypasses validation for a malicious model (PASS)
  • Integration test: ML_SKIP_MODEL_VALIDATION=false still validates normally (PASS)
  • Integration test: benign model passes validation as before (PASS)
  • CI passes

Provides an emergency escape hatch to bypass TorchScript model graph
validation without requiring a code change or rebuild. When
ML_SKIP_MODEL_VALIDATION is set (to any value), the pytorch_inference
process skips the graph validator and logs a warning.

Elasticsearch can set this environment variable for the native
process via its ML settings, allowing operators to unblock model
deployments immediately if the validator incorrectly rejects a
legitimate model.

Made-with: Cursor
@prodsecmachine
Copy link
Copy Markdown

prodsecmachine commented Mar 26, 2026

Snyk checks have passed. No issues have been found so far.

Status Scan Engine Critical High Medium Low Total (0)
Open Source Security 0 0 0 0 0 issues
Licenses 0 0 0 0 0 issues

💻 Catch issues earlier using the plugins for VS Code, JetBrains IDEs, Visual Studio, and Eclipse.

Extends the evil model integration test to verify that:
- ML_SKIP_MODEL_VALIDATION=true bypasses graph validation (with
  warning logged)
- ML_SKIP_MODEL_VALIDATION=false still validates (only exact "true"
  activates the bypass)

Made-with: Cursor
@edsavage
Copy link
Copy Markdown
Contributor Author

edsavage commented Mar 26, 2026

Bypass Deployment Guide

The ML_SKIP_MODEL_VALIDATION=true environment variable is an operator-level emergency lever — it doesn't require a code change or release. It must be set in the process environment before pytorch_inference starts.

Who sets it and how

Deployment Operator How to set
Self-managed (bare metal/VM) Cluster admin export ML_SKIP_MODEL_VALIDATION=true before starting ES, or add to /etc/default/elasticsearch / systemd unit override
Self-managed (Docker) Cluster admin docker run -e ML_SKIP_MODEL_VALIDATION=true ...
Self-managed (Kubernetes) Cluster admin Add to pod spec env: field or ConfigMap
Elastic Cloud managed Elastic Cloud ops team Deployment configuration
Serverless Elastic platform/SRE team Kubernetes pod spec on ML nodes

Important notes

  • This is not a user-facing setting — it requires infrastructure access
  • Only the exact value "true" activates the bypass; any other value or unset means validation runs normally
  • When active, a WARN log is emitted: "Model graph validation SKIPPED" — this is visible in ES node logs
  • The env var is inherited by all pytorch_inference child processes on the node, so it disables validation for all models on that node

@edsavage edsavage requested review from Copilot and valeriy42 and removed request for Copilot March 26, 2026 03:47
@edsavage edsavage changed the title [ML] Add ML_SKIP_MODEL_VALIDATION kill switch for graph validation [ML] Add ML_SKIP_MODEL_VALIDATION bypass for graph validation Mar 26, 2026
@edsavage edsavage requested a review from Copilot March 26, 2026 21:17
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds an environment-variable “kill switch” to bypass TorchScript model graph validation in pytorch_inference, plus a Python integration script intended to exercise validator behavior (including the bypass).

Changes:

  • Add ML_SKIP_MODEL_VALIDATION=true env-var check to skip verifySafeModel() and emit a warning.
  • Add a standalone Python script that generates known-malicious TorchScript models and runs pytorch_inference to confirm rejection/bypass behavior.

Reviewed changes

Copilot reviewed 1 out of 2 changed files in this pull request and generated 3 comments.

File Description
bin/pytorch_inference/Main.cc Adds the ML_SKIP_MODEL_VALIDATION env-var bypass around verifySafeModel() with warning logging.
test/test_pytorch_inference_evil_models.py Adds a standalone integration script to generate “evil” models and validate expected pytorch_inference behavior (including bypass).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

generate_model(spec["class"], model_path)
print(f" Model generated: {model_path.name} ({model_path.stat().st_size} bytes)")
except Exception as e:
print(f" SKIP: could not generate model: {e}")
Copy link

Copilot AI Mar 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If TorchScript scripting fails for a model (e.g., due to Torch version differences), this test currently prints SKIP and continues, which can result in an overall PASS without having exercised the validator at all. For a security regression test, it would be safer to treat model-generation failures as a test failure (or at least fail when the expected-rejected models can’t be generated).

Suggested change
print(f" SKIP: could not generate model: {e}")
print(f" FAIL: could not generate model: {e}")
all_passed = False

Copilot uses AI. Check for mistakes.
Comment on lines +216 to +219
raise FileNotFoundError(
"Could not find pytorch_inference binary. "
"Build from the feature/harden_pytorch_inference branch, or pass --binary."
)
Copy link

Copilot AI Mar 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This script’s requirements/error message still references building from the "feature/harden_pytorch_inference" branch. That’s likely to become stale/confusing once this change is on main; consider updating the wording to refer to a built pytorch_inference binary (or a minimum version) rather than a specific branch name.

Copilot uses AI. Check for mistakes.
Comment on lines +24 to +25
Requires: torch, a built pytorch_inference binary with graph validation
(feature/harden_pytorch_inference branch or later).
Copy link

Copilot AI Mar 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The docstring says this requires a binary built from the "feature/harden_pytorch_inference" branch. Since this file is being added to the mainline repo, consider updating this to a stable requirement (e.g., “a pytorch_inference binary built from this repo at/after ”) to avoid confusion for future readers.

Suggested change
Requires: torch, a built pytorch_inference binary with graph validation
(feature/harden_pytorch_inference branch or later).
Requires: torch, and a built pytorch_inference binary from this repository
with graph validation enabled (i.e., including the
CModelGraphValidator checks).

Copilot uses AI. Check for mistakes.
- Update stale branch references to generic requirements
- Treat model generation failures as test failures, not skips —
  for security regression tests, silently skipping is unsafe

Made-with: Cursor
Copy link
Copy Markdown
Contributor

@valeriy42 valeriy42 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see the reason for wanting an escape patch, but setting an environment variable is not a practical solution. You need a cluster setting and a --skipValidation flag on the pytorch_inference process.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants