Add vision evaluation metrics (exact_match, relaxed_accuracy, word_sort_ratio)#2474
Closed
jiafatom wants to merge 4 commits into
Closed
Add vision evaluation metrics (exact_match, relaxed_accuracy, word_sort_ratio)#2474jiafatom wants to merge 4 commits into
jiafatom wants to merge 4 commits into
Conversation
…rt_ratio) Add vision evaluation metrics to the Olive evaluator framework, enabling VQA, ChartQA, and OCR model evaluation. - exact_match: case-insensitive string equality for VQA tasks - relaxed_accuracy: ±5% numeric tolerance for ChartQA - word_sort_ratio: word-level overlap ratio for OCR Changes: - olive/evaluator/metric.py: Add EXACT_MATCH, RELAXED_ACCURACY, WORD_SORT_RATIO to AccuracySubType - olive/evaluator/accuracy.py: Add ExactMatch, RelaxedAccuracy, WordSortRatio classes - olive/evaluator/olive_evaluator.py: Add _inference_vision() path and task-metric validation - olive/data/component/pre_process_data.py: Add vision_vqa_pre_process data component - olive/data/container/huggingface_container.py: Add vision-vqa, vision-chart-qa, vision-ocr tasks - olive/olive_config.json: Add vision extra dependencies (Pillow) - test/evaluator/test_accuracy.py: Add 20 unit tests for vision metrics Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Contributor
There was a problem hiding this comment.
Pull request overview
This PR extends Olive’s evaluator framework with three vision-oriented “accuracy” sub-metrics intended for VQA/ChartQA/OCR evaluation, following the existing pattern used for speech metrics.
Changes:
- Adds new
AccuracySubTypeenum values forexact_match,relaxed_accuracy, andword_sort_ratio. - Implements
ExactMatch,RelaxedAccuracy, andWordSortRatiometric classes and adds unit tests for their core behavior. - Introduces a vision inference path in the evaluator and adds HuggingFace container task mappings plus a new
vision_vqa_pre_processcomponent and avisionextra dependency.
Reviewed changes
Copilot reviewed 7 out of 7 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
olive/evaluator/metric.py |
Adds new accuracy sub-type enum values for vision metrics. |
olive/evaluator/accuracy.py |
Implements the three new vision metric computations. |
olive/evaluator/olive_evaluator.py |
Adds vision inference path and task↔metric validation logic for vision tasks. |
olive/data/component/pre_process_data.py |
Adds a new vision_vqa_pre_process pre-processing component with sampling support. |
olive/data/container/huggingface_container.py |
Registers new HuggingFace task types mapping to the vision pre-process component. |
olive/olive_config.json |
Adds a vision extra dependency entry. |
test/evaluator/test_accuracy.py |
Adds unit tests covering the new vision metrics. |
Comment on lines
+117
to
+121
| if metric.data_config and hasattr(metric.data_config, "task_type"): | ||
| task_type = metric.data_config.task_type | ||
| elif metric.data_config and hasattr(metric.data_config, "params_config"): | ||
| task_type = getattr(metric.data_config.params_config, "task_type", None) | ||
|
|
Comment on lines
+446
to
+450
| image = item[self.image_column] | ||
| question = item[self.question_column] | ||
| answer = item[self.answer_column] | ||
| # Handle list answers (some datasets have multiple valid answers) | ||
| if isinstance(answer, list): |
|
|
||
| # Text-based accuracy sub-types that work with string predictions/targets | ||
| _TEXT_BASED_ACCURACY_SUBTYPES = {AccuracySubType.WER, AccuracySubType.RTFX} | ||
| _VISION_ACCURACY_SUBTYPES = {AccuracySubType.EXACT_MATCH, AccuracySubType.RELAXED_ACCURACY, AccuracySubType.WORD_SORT_RATIO} |
| "torch-tensorrt": [ "torch-tensorrt" ], | ||
| "tune-session-params": [ "psutil" ] | ||
| "tune-session-params": [ "psutil" ], | ||
| "vision": [ "Pillow" ] |
Document which datasets each vision metric is suitable for: - exact_match: AI2D, ScienceQA, TextVQA, MathVista, MMMU, InterGPS - relaxed_accuracy: ChartQA - word_sort_ratio: OCR Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Contributor
Author
Supported Vision BenchmarksThese metrics map to standard public vision benchmarks:
Recipe PREvaluation recipe using these metrics with Qwen3-VL-2B-Instruct on AI2D: microsoft/olive-recipes#434 |
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
jiafatom
added a commit
to jiafatom/olive-recipes
that referenced
this pull request
May 27, 2026
Add evaluation scripts for Qwen3-VL-2B-Instruct using Olive's new vision metrics (exact_match) on the AI2D science diagram QA benchmark. New files: - eval/evaluate.py: Standalone evaluation script with ONNX + optional PyTorch - eval/eval_user_script.py: Olive-compatible post-processing function - eval/qwen3vl_eval_ai2d.json: Olive config for running evaluation via olive run - eval/requirements.txt: Dependencies - eval/README.md: Usage instructions Related Olive PR: microsoft/Olive#2474 Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Fix _validate_vision_task_metric to extract task from pre_process_data_config.params['task'] instead of non-existent DataConfig attributes - Wrap _VISION_ACCURACY_SUBTYPES across multiple lines for lint compliance - Use lowercase 'pillow' in olive_config.json for consistency - Add docstring note about ONNX vs PyTorch path for vision_vqa_pre_process Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Contributor
|
could create a branch directly on the repo and open the PR from there? CI cannot run on forked PRs because of credentials. |
Contributor
Author
|
Closing in favor of a new PR from the upstream repo branch (CI requires non-fork PRs). |
Contributor
Author
|
See #2476 |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Add vision evaluation metrics to the Olive evaluator framework, enabling VQA, ChartQA, and OCR model evaluation following the same pattern as #2444 (speech metrics).
Changes
olive/evaluator/metric.py— AddEXACT_MATCH,RELAXED_ACCURACY,WORD_SORT_RATIOtoAccuracySubTypeenumolive/evaluator/accuracy.py— AddExactMatch,RelaxedAccuracy,WordSortRatioclassesolive/evaluator/olive_evaluator.py— Add_inference_vision()path in OnnxEvaluator and PyTorchEvaluator; add task-metric validation that throws if metric is incompatible with task typeolive/data/component/pre_process_data.py— Addvision_vqa_pre_processdata component with Olive-style sampling (--limit/--seed)olive/data/container/huggingface_container.py— Addvision-vqa,vision-chart-qa,vision-ocrtask typesolive/olive_config.json— Addvisionextra dependencies (Pillow)Task-Metric Validation
Each vision task restricts which metrics are valid:
vision-vqaexact_matchvision-chart-qarelaxed_accuracyvision-ocrword_sort_ratioIf a user selects an incompatible metric, a clear exception is raised.
Testing
20 unit tests added covering all three metrics with edge cases (all passing).
Usage
{ "metrics": [{ "name": "vision_eval", "type": "accuracy", "sub_types": [ {"name": "exact_match", "higher_is_better": true} ], "data_config": { "type": "HuggingfaceContainer", "task_type": "vision-vqa", "params": { "data_name": "HuggingFaceH4/ScienceQA", "split": "test" } } }] }