Add vision genai inference path for multi-file VLM evaluation#2488
Merged
Conversation
Contributor
There was a problem hiding this comment.
Pull request overview
Adds a genai-based vision inference path to OnnxEvaluator so that olive run can evaluate multi-file ONNX vision-language models (e.g., Qwen3-VL) that ship with a genai_config.json. The dispatcher in _evaluate_onnx_accuracy now auto-detects whether the model is a genai VLM (by inspecting genai_config.json for a vision field) and routes to a new _inference_vision_genai method that drives generation through onnxruntime_genai's multimodal processor, generator, and tokenizer.
Changes:
- Extend the vision-metric branch of
_evaluate_onnx_accuracyto auto-detect genai VLMs viagenai_config.jsonand route accordingly. - Implement
_inference_vision_genai, which builds anog.Model, formats chat-style multimodal prompts, runs autoregressive generation per sample, and returns decoded predictions plus targets. - Preserve existing behavior for single-file VQA ONNX models by falling back to
_inference_vision.
01d07cb to
c032caf
Compare
Adds _inference_vision_genai method to OnnxEvaluator that uses onnxruntime-genai for vision-language models (e.g., Qwen3-VL) with multi-file ONNX architectures (vision.onnx, text.onnx, embedding.onnx). The method is auto-detected when genai_config.json exists and contains a 'vision' field in the model config. This mirrors the existing auto-detection pattern used for speech models (whisper, nemotron_speech). For single-file ONNX VQA models, the existing _inference_vision path (classification-style single forward pass) is still used. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Previously, when a component (e.g., pre_process_data) specified a task type, only that same component's override was applied from the task map. This meant the vision-vqa dataloader override (vision_vqa_dataloader with custom collate_fn for PIL images) was never applied since it was a different component than the one specifying the task. Now, when any component specifies a task type, ALL component overrides from the task_type_components_map are applied. This ensures the custom dataloader with PIL-safe collation is used for vision tasks. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Simplify dispatch logic: use single boolean flag instead of duplicated fallback branches - Honor execution_providers parameter: map user-specified EPs to og.Config providers instead of only checking device - Use TemporaryDirectory instead of per-file NamedTemporaryFile to avoid I/O overhead and file leak risk - Add comment clarifying pred/target alignment when image is None Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
onnxruntime-genai uses CPU by default when no provider is appended. CPUExecutionProvider is not a recognized genai provider name, so skip it rather than trying to map it. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
onnxruntime-genai uses short provider names (e.g., 'cuda') not ORT-style
names ('CUDAExecutionProvider'). Match the pattern used by the existing
speech genai methods: only check device field for provider selection.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The genai_config.json may specify max_length equal to the full context window (e.g., 262144) which causes near-infinite generation for VQA tasks where answers are typically 1-10 tokens. Cap at 128 tokens. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
max_length in genai is total sequence length (input + output). Vision inputs include image tokens which can be 200+ tokens, so 128 was too small. Use 4096 which accommodates input tokens plus short VQA answers while still preventing runaway generation from 262K context windows. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Remove unused params (metric, execution_providers) from _inference_vision_genai signature - Remove unused genai_config variable (was loaded but not used) - Document that device drives GPU/CPU selection in genai - Rename local var to genai_cfg to avoid shadowing - Run ruff format Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Allow passing a system_prompt parameter in pre_process config to guide model responses (e.g., 'reply with only the option number'). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Add options_col param to format multiple-choice options into the question - Extract leading number from model responses (e.g. '1. D' -> '1') - Add debug logging to vision_eval_debug.jsonl in model dir Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Extract _load_genai_config helper to deduplicate config detection - Remove hardcoded number extraction (task-specific, not generic) - Remove debug logging (was dev instrumentation) - Use 'from e' instead of 'from None' in ImportError - Add missing docstring params for options_col and system_prompt Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
When options_col is specified in pre_process config, set extract_number=True in the input dict. The evaluator uses this flag to extract the leading number from model responses (e.g. '1. D' -> '1'), which is needed for correct exact_match scoring on multiple-choice benchmarks like AI2D. This is not applied for OCR/ChartQA tasks where numeric predictions are valid. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
20c01d2 to
d9db22c
Compare
… tests - Use 'vision' in dict check instead of bool() to handle empty vision objects - Add TestOnnxEvaluatorGenaiVisionDetection test class with 8 tests covering: - _load_genai_config helper (present/missing) - Vision detection logic (with vision, empty vision, no vision, no config) - Dispatch routing (genai vs standard vision path)
d9db22c to
fa7a239
Compare
Comment on lines
+586
to
+594
| def _load_genai_config(model: ONNXModelHandler) -> Optional[dict]: | ||
| """Load genai_config.json from the model directory, or return None if not found.""" | ||
| genai_config_path = Path(model.model_path).parent / "genai_config.json" | ||
| if not genai_config_path.exists(): | ||
| return None | ||
| import json | ||
|
|
||
| with genai_config_path.open() as f: | ||
| return json.load(f) |
Comment on lines
+836
to
+842
| import json | ||
| import re | ||
| import tempfile | ||
|
|
||
| from PIL import Image | ||
|
|
||
| model_dir = str(Path(model.model_path).parent) |
Comment on lines
+887
to
+890
| # Ensure PIL Image | ||
| if not isinstance(pil_image, Image.Image): | ||
| pil_image = Image.open(pil_image).convert("RGB") | ||
|
|
…andle leak - Wrap genai_config.json parsing in try/except JSONDecodeError with filepath in message - Guard PIL import with ImportError and helpful install message - Use context manager for Image.open() to close file handle promptly
xiaoyu-work
approved these changes
Jun 2, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Describe your changes
Adds
_inference_vision_genaimethod toOnnxEvaluatorthat enablesolive runto evaluate multi-file ONNX vision-language models (e.g., Qwen3-VL) usingonnxruntime-genai.Problem
Vision-language models exported via
onnxruntime-genaiproduce multiple ONNX files (vision.onnx,text.onnx,embedding.onnx) with agenai_config.json. The existing_inference_visionmethod only supports single-file ONNX models with classification-style forward pass. This prevented usingolive run --configfor evaluation of autoregressive VLMs.Solution
genai_config.jsoncontains avisionfield_inference_vision_genaimethod which usesog.Model,multimodal_processor, andog.Generatorfor autoregressive text generation_inference_text_genaifor Whisper,_inference_text_genai_streamingfor Nemotron)_inference_visionfor single-file ONNX VQA modelsUsage
{ "input_model": { "type": "OnnxModel", "model_path": "path/to/models", "onnx_file_name": "text.onnx" } }The evaluator will auto-detect
genai_config.jsonin the model directory and use the genai path.Related