Add vision evaluation metrics (exact_match, relaxed_accuracy, word_sort_ratio) by jiafatom · Pull Request #2476 · microsoft/Olive

jiafatom · 2026-05-27T20:25:34Z

Summary

Extends Olive's evaluator framework with three vision-oriented accuracy sub-metrics for VQA, ChartQA, and OCR evaluation, following the existing pattern used for speech metrics (PR #2444).

New Metrics

Metric	Task Type	Suitable Benchmarks
`exact_match`	`vision-vqa`	AI2D, ScienceQA, TextVQA, MathVista, MMMU, InterGPS
`relaxed_accuracy`	`vision-chart-qa`	ChartQA (±5% numeric tolerance for numbers)
`word_sort_ratio`	`vision-ocr`	OCR benchmarks (word-level overlap)

Changes

olive/evaluator/metric.py: Adds EXACT_MATCH, RELAXED_ACCURACY, WORD_SORT_RATIO to AccuracySubType enum
olive/evaluator/accuracy.py: Implements the three metric classes
olive/evaluator/olive_evaluator.py: Adds vision inference path and task↔metric validation
olive/data/component/pre_process_data.py: Adds vision_vqa_pre_process component
olive/data/container/huggingface_container.py: Registers vision-vqa, vision-chart-qa, vision-ocr task types
olive/olive_config.json: Adds vision extras (pillow)
test/evaluator/test_accuracy.py: 20 unit tests covering all new metrics

Design

Vision metrics are text-based (compare predicted answer string to ground truth), task-dependent
Task-metric validation ensures incompatible combinations raise ValueError
PyTorch path: model processor handles images natively
ONNX path: requires user-provided pre-processing to produce numeric tensors

…rt_ratio) Add vision evaluation metrics to the Olive evaluator framework, enabling VQA, ChartQA, and OCR model evaluation. - exact_match: case-insensitive string equality for VQA tasks - relaxed_accuracy: ±5% numeric tolerance for ChartQA - word_sort_ratio: word-level overlap ratio for OCR Changes: - olive/evaluator/metric.py: Add EXACT_MATCH, RELAXED_ACCURACY, WORD_SORT_RATIO to AccuracySubType - olive/evaluator/accuracy.py: Add ExactMatch, RelaxedAccuracy, WordSortRatio classes - olive/evaluator/olive_evaluator.py: Add _inference_vision() path and task-metric validation - olive/data/component/pre_process_data.py: Add vision_vqa_pre_process data component - olive/data/container/huggingface_container.py: Add vision-vqa, vision-chart-qa, vision-ocr tasks - olive/olive_config.json: Add vision extra dependencies (Pillow) - test/evaluator/test_accuracy.py: Add 20 unit tests for vision metrics Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Document which datasets each vision metric is suitable for: - exact_match: AI2D, ScienceQA, TextVQA, MathVista, MMMU, InterGPS - relaxed_accuracy: ChartQA - word_sort_ratio: OCR Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

- Fix _validate_vision_task_metric to extract task from pre_process_data_config.params['task'] instead of non-existent DataConfig attributes - Wrap _VISION_ACCURACY_SUBTYPES across multiple lines for lint compliance - Use lowercase 'pillow' in olive_config.json for consistency - Add docstring note about ONNX vs PyTorch path for vision_vqa_pre_process Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Copilot

Pull request overview

This PR extends Olive’s evaluator framework with three vision-oriented, text-based accuracy sub-metrics intended for VQA/ChartQA/OCR-style evaluation, adding corresponding task registrations, data pre-processing, and unit tests.

Changes:

Add three new AccuracySubType values (exact_match, relaxed_accuracy, word_sort_ratio) and implement their metric logic.
Introduce a vision string-inference path in both ONNX and PyTorch evaluators, including task↔metric compatibility validation.
Register new HuggingFace task types and add a vision VQA pre-process component plus unit tests for the new metrics.

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 2 comments.

Show a summary per file

File	Description
`olive/evaluator/metric.py`	Adds new vision accuracy sub-types to the enum.
`olive/evaluator/accuracy.py`	Implements `ExactMatch`, `RelaxedAccuracy`, and `WordSortRatio`.
`olive/evaluator/olive_evaluator.py`	Adds vision inference paths and task/metric validation for vision metrics.
`olive/data/component/pre_process_data.py`	Adds `vision_vqa_pre_process` that emits (image, question) inputs and string answers.
`olive/data/container/huggingface_container.py`	Registers new vision task types mapping to the vision pre-process component.
`olive/olive_config.json`	Adds a `vision` extra dependency (`pillow`).
`test/evaluator/test_accuracy.py`	Adds unit tests for all new metric classes.

+            DataComponentType.PRE_PROCESS_DATA.value: "vision_vqa_pre_process",
+        },
+        "vision-chart-qa": {
+            DataComponentType.PRE_PROCESS_DATA.value: "vision_vqa_pre_process",
+        },
+        "vision-ocr": {
+            DataComponentType.PRE_PROCESS_DATA.value: "vision_vqa_pre_process",


+
+    Note: This returns raw PIL images and question strings. For the PyTorch evaluator,
+    the model's own processor/tokenizer should be applied in the post_func or within
+    the model's forward method. For the ONNX evaluator, provide a custom pre-process
+    component that applies the appropriate processor/tokenizer to produce numeric
+    tensors matching the model's io_config.


jambayk · 2026-05-27T20:51:32Z

+        for batch in dataloader:
+            input_data, labels = OliveEvaluator.unpack_batch_for_accuracy(batch)
+            input_feed = format_data(input_data, io_config)
+            result = model.run_session(session, input_feed, **run_kwargs)


curious if there are are models that can generate the answer just with a single session run. are these model's conditional generation models + don't need a generation loop?

shaahji · 2026-05-27T20:31:06Z

+            question = item[self.question_column]
+            answer = item[self.answer_column]
+            # Handle list answers (some datasets have multiple valid answers)
+            if isinstance(answer, list):


list or tuple

shaahji · 2026-05-27T20:33:53Z

+            """Collate VQA batches. Use with batch_size=1 for variable-size images."""
+            if len(batch) == 1:
+                input_dict, answer = batch[0]
+                return (input_dict, [answer])


How are "multiple valid answers" handled here? Wouldn't this create a list of lists thus making it a bad input downstream?

shaahji · 2026-05-27T20:47:06Z

+                if ref_val == 0:
+                    if pred_val == 0:
+                        correct += 1
+                elif abs(pred_val - ref_val) / abs(ref_val) <= tolerance:
+                    correct += 1


nit: Simplify this snippet

if (ref_val == pred_val) or (abs(pred_val - ref_val) / abs(ref_val) <= tolerance): correct += 1

jiafatom and others added 4 commits May 27, 2026 18:13

Remove internal project references from comments

28d8110

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Copilot AI review requested due to automatic review settings May 27, 2026 20:25

Copilot started reviewing on behalf of jiafatom May 27, 2026 20:25 View session

jiafatom mentioned this pull request May 27, 2026

Add vision evaluation metrics (exact_match, relaxed_accuracy, word_sort_ratio) #2474

Closed

Copilot AI reviewed May 27, 2026

View reviewed changes

jambayk reviewed May 27, 2026

View reviewed changes

shaahji requested changes May 27, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add vision evaluation metrics (exact_match, relaxed_accuracy, word_sort_ratio)#2476

Add vision evaluation metrics (exact_match, relaxed_accuracy, word_sort_ratio)#2476
jiafatom wants to merge 4 commits into
mainfrom
jiafa/add-vision-eval-metrics

jiafatom commented May 27, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

jambayk May 27, 2026

Uh oh!

shaahji May 27, 2026

Uh oh!

shaahji May 27, 2026

Uh oh!

shaahji May 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

jiafatom commented May 27, 2026

Summary

New Metrics

Changes

Design

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

jambayk May 27, 2026

Choose a reason for hiding this comment

Uh oh!

shaahji May 27, 2026

Choose a reason for hiding this comment

Uh oh!

shaahji May 27, 2026

Choose a reason for hiding this comment

Uh oh!

shaahji May 27, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants