Skip to content

Add vision evaluation for Qwen3-VL-2B-Instruct on AI2D#434

Open
jiafatom wants to merge 1 commit into
microsoft:mainfrom
jiafatom:jiafa/add-vision-eval-qwen3vl
Open

Add vision evaluation for Qwen3-VL-2B-Instruct on AI2D#434
jiafatom wants to merge 1 commit into
microsoft:mainfrom
jiafatom:jiafa/add-vision-eval-qwen3vl

Conversation

@jiafatom
Copy link
Copy Markdown
Contributor

Summary

Add evaluation scripts for Qwen3-VL-2B-Instruct, computing exact_match accuracy on the AI2D science diagram QA benchmark using Olive's new vision metrics.

New Files

  • eval/evaluate.py — Standalone evaluation script with ONNX inference via onnxruntime-genai (+ optional PyTorch comparison)
  • eval/eval_user_script.py — Olive-compatible post-processing function for use with olive run
  • eval/qwen3vl_eval_ai2d.json — Olive config for running evaluation
  • eval/requirements.txt — Dependencies
  • eval/README.md — Usage instructions

Usage

# Standalone (100 samples, default)
python eval/evaluate.py --model_path cpu_and_mobile/models

# Via Olive
olive run --config eval/qwen3vl_eval_ai2d.json

Metrics

Metric Description Task
exact_match Case-insensitive string equality VQA (AI2D)

Related

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds an AI2D (science diagram VQA) evaluation workflow for the Qwen3-VL-2B-Instruct ONNX recipe, including a standalone runner and an olive run-compatible config that reports exact_match using Olive’s vision metrics.

Changes:

  • Add a standalone evaluation script (evaluate.py) for running AI2D exact-match evaluation (with optional PyTorch comparison).
  • Add an Olive evaluation config (qwen3vl_eval_ai2d.json) plus an Olive post-processing user script (eval_user_script.py).
  • Add evaluation docs and dependency list under builtin/eval/.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
Qwen-Qwen3-VL-2B-Instruct/builtin/eval/requirements.txt Adds Python dependencies needed to run the standalone and Olive-based evaluation.
Qwen-Qwen3-VL-2B-Instruct/builtin/eval/README.md Documents how to run the evaluation standalone and via Olive.
Qwen-Qwen3-VL-2B-Instruct/builtin/eval/qwen3vl_eval_ai2d.json Olive config wiring dataset loading, preprocessing, metric selection, and post-processing.
Qwen-Qwen3-VL-2B-Instruct/builtin/eval/evaluate.py Standalone evaluation loop for AI2D with ORT GenAI (and optional PyTorch baseline).
Qwen-Qwen3-VL-2B-Instruct/builtin/eval/eval_user_script.py Implements Olive post-processing to normalize model outputs for exact-match scoring.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +17 to +28
import argparse
import io
import json
import re
import sys
import time
from pathlib import Path

import onnxruntime_genai as og
from datasets import load_dataset
from PIL import Image

Comment on lines +41 to +44
# Try to find a single digit answer (multiple choice)
m = re.search(r"\b([1-4])\b", text)
if m:
return m.group(1)
Comment on lines +11 to +14
"""Post-process Qwen3-VL model output to extract answer text.

For multiple-choice VQA (AI2D), extracts the first digit (1-4) from output.
For open-ended VQA, returns the decoded text directly.
Add evaluation scripts for Qwen3-VL-2B-Instruct using Olive's new vision
metrics (exact_match) on the AI2D science diagram QA benchmark.

New files:
- eval/evaluate.py: Standalone evaluation script with ONNX + optional PyTorch
- eval/eval_user_script.py: Olive-compatible post-processing function
- eval/qwen3vl_eval_ai2d.json: Olive config for running evaluation via olive run
- eval/requirements.txt: Dependencies
- eval/README.md: Usage instructions

Related Olive PR: microsoft/Olive#2474

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@jiafatom jiafatom force-pushed the jiafa/add-vision-eval-qwen3vl branch from 6924c18 to bd41311 Compare May 27, 2026 18:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants