Add vision evaluation for Qwen3-VL-2B-Instruct on AI2D#434
Open
jiafatom wants to merge 1 commit into
Open
Conversation
Contributor
There was a problem hiding this comment.
Pull request overview
Adds an AI2D (science diagram VQA) evaluation workflow for the Qwen3-VL-2B-Instruct ONNX recipe, including a standalone runner and an olive run-compatible config that reports exact_match using Olive’s vision metrics.
Changes:
- Add a standalone evaluation script (
evaluate.py) for running AI2D exact-match evaluation (with optional PyTorch comparison). - Add an Olive evaluation config (
qwen3vl_eval_ai2d.json) plus an Olive post-processing user script (eval_user_script.py). - Add evaluation docs and dependency list under
builtin/eval/.
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| Qwen-Qwen3-VL-2B-Instruct/builtin/eval/requirements.txt | Adds Python dependencies needed to run the standalone and Olive-based evaluation. |
| Qwen-Qwen3-VL-2B-Instruct/builtin/eval/README.md | Documents how to run the evaluation standalone and via Olive. |
| Qwen-Qwen3-VL-2B-Instruct/builtin/eval/qwen3vl_eval_ai2d.json | Olive config wiring dataset loading, preprocessing, metric selection, and post-processing. |
| Qwen-Qwen3-VL-2B-Instruct/builtin/eval/evaluate.py | Standalone evaluation loop for AI2D with ORT GenAI (and optional PyTorch baseline). |
| Qwen-Qwen3-VL-2B-Instruct/builtin/eval/eval_user_script.py | Implements Olive post-processing to normalize model outputs for exact-match scoring. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Comment on lines
+17
to
+28
| import argparse | ||
| import io | ||
| import json | ||
| import re | ||
| import sys | ||
| import time | ||
| from pathlib import Path | ||
|
|
||
| import onnxruntime_genai as og | ||
| from datasets import load_dataset | ||
| from PIL import Image | ||
|
|
Comment on lines
+41
to
+44
| # Try to find a single digit answer (multiple choice) | ||
| m = re.search(r"\b([1-4])\b", text) | ||
| if m: | ||
| return m.group(1) |
Comment on lines
+11
to
+14
| """Post-process Qwen3-VL model output to extract answer text. | ||
|
|
||
| For multiple-choice VQA (AI2D), extracts the first digit (1-4) from output. | ||
| For open-ended VQA, returns the decoded text directly. |
Add evaluation scripts for Qwen3-VL-2B-Instruct using Olive's new vision metrics (exact_match) on the AI2D science diagram QA benchmark. New files: - eval/evaluate.py: Standalone evaluation script with ONNX + optional PyTorch - eval/eval_user_script.py: Olive-compatible post-processing function - eval/qwen3vl_eval_ai2d.json: Olive config for running evaluation via olive run - eval/requirements.txt: Dependencies - eval/README.md: Usage instructions Related Olive PR: microsoft/Olive#2474 Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
6924c18 to
bd41311
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Add evaluation scripts for Qwen3-VL-2B-Instruct, computing exact_match accuracy on the AI2D science diagram QA benchmark using Olive's new vision metrics.
New Files
eval/evaluate.py— Standalone evaluation script with ONNX inference via onnxruntime-genai (+ optional PyTorch comparison)eval/eval_user_script.py— Olive-compatible post-processing function for use witholive runeval/qwen3vl_eval_ai2d.json— Olive config for running evaluationeval/requirements.txt— Dependencieseval/README.md— Usage instructionsUsage
Metrics
exact_matchRelated