Add vision evaluation for Qwen3-VL-2B-Instruct on AI2D by jiafatom · Pull Request #434 · microsoft/olive-recipes

jiafatom · 2026-05-27T18:39:18Z

Summary

Add evaluation scripts for Qwen3-VL-2B-Instruct, computing exact_match accuracy on the AI2D science diagram QA benchmark using Olive's new vision metrics.

New Files

eval/evaluate.py — Standalone evaluation script with ONNX inference via onnxruntime-genai (+ optional PyTorch comparison)
eval/eval_user_script.py — Olive-compatible post-processing function for use with olive run
eval/qwen3vl_eval_ai2d.json — Olive config for running evaluation
eval/requirements.txt — Dependencies
eval/README.md — Usage instructions

Usage

# Standalone (100 samples, default)
python eval/evaluate.py --model_path cpu_and_mobile/models

# Via Olive
olive run --config eval/qwen3vl_eval_ai2d.json

Metrics

Metric	Description	Task
`exact_match`	Case-insensitive string equality	VQA (AI2D)

Pull request overview

Adds an AI2D (science diagram VQA) evaluation workflow for the Qwen3-VL-2B-Instruct ONNX recipe, including a standalone runner and an olive run-compatible config that reports exact_match using Olive’s vision metrics.

Changes:

Add a standalone evaluation script (evaluate.py) for running AI2D exact-match evaluation (with optional PyTorch comparison).
Add an Olive evaluation config (qwen3vl_eval_ai2d.json) plus an Olive post-processing user script (eval_user_script.py).
Add evaluation docs and dependency list under builtin/eval/.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 3 comments.

Show a summary per file

File	Description
Qwen-Qwen3-VL-2B-Instruct/builtin/eval/requirements.txt	Adds Python dependencies needed to run the standalone and Olive-based evaluation.
Qwen-Qwen3-VL-2B-Instruct/builtin/eval/README.md	Documents how to run the evaluation standalone and via Olive.
Qwen-Qwen3-VL-2B-Instruct/builtin/eval/qwen3vl_eval_ai2d.json	Olive config wiring dataset loading, preprocessing, metric selection, and post-processing.
Qwen-Qwen3-VL-2B-Instruct/builtin/eval/evaluate.py	Standalone evaluation loop for AI2D with ORT GenAI (and optional PyTorch baseline).
Qwen-Qwen3-VL-2B-Instruct/builtin/eval/eval_user_script.py	Implements Olive post-processing to normalize model outputs for exact-match scoring.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

+import argparse
+import io
+import json
+import re
+import sys
+import time
+from pathlib import Path
+
+import onnxruntime_genai as og
+from datasets import load_dataset
+from PIL import Image
+


+    # Try to find a single digit answer (multiple choice)
+    m = re.search(r"\b([1-4])\b", text)
+    if m:
+        return m.group(1)


+    """Post-process Qwen3-VL model output to extract answer text.
+
+    For multiple-choice VQA (AI2D), extracts the first digit (1-4) from output.
+    For open-ended VQA, returns the decoded text directly.


Add evaluation scripts for Qwen3-VL-2B-Instruct using Olive's new vision metrics (exact_match) on the AI2D science diagram QA benchmark. New files: - eval/evaluate.py: Standalone evaluation script with ONNX + optional PyTorch - eval/eval_user_script.py: Olive-compatible post-processing function - eval/qwen3vl_eval_ai2d.json: Olive config for running evaluation via olive run - eval/requirements.txt: Dependencies - eval/README.md: Usage instructions Related Olive PR: microsoft/Olive#2474 Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Copilot AI review requested due to automatic review settings May 27, 2026 18:39

Copilot started reviewing on behalf of jiafatom May 27, 2026 18:39 View session

jiafatom mentioned this pull request May 27, 2026

Add vision evaluation metrics (exact_match, relaxed_accuracy, word_sort_ratio) microsoft/Olive#2474

Closed

Copilot AI reviewed May 27, 2026

View reviewed changes

jiafatom force-pushed the jiafa/add-vision-eval-qwen3vl branch from 6924c18 to bd41311 Compare May 27, 2026 18:53

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add vision evaluation for Qwen3-VL-2B-Instruct on AI2D#434

Add vision evaluation for Qwen3-VL-2B-Instruct on AI2D#434
jiafatom wants to merge 1 commit into
microsoft:mainfrom
jiafatom:jiafa/add-vision-eval-qwen3vl

jiafatom commented May 27, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

jiafatom commented May 27, 2026

Summary

New Files

Usage

Metrics

Related

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants