Skip to content

Add vision evaluation metrics (exact_match, relaxed_accuracy, word_sort_ratio)#2476

Open
jiafatom wants to merge 4 commits into
mainfrom
jiafa/add-vision-eval-metrics
Open

Add vision evaluation metrics (exact_match, relaxed_accuracy, word_sort_ratio)#2476
jiafatom wants to merge 4 commits into
mainfrom
jiafa/add-vision-eval-metrics

Conversation

@jiafatom
Copy link
Copy Markdown
Contributor

Summary

Extends Olive's evaluator framework with three vision-oriented accuracy sub-metrics for VQA, ChartQA, and OCR evaluation, following the existing pattern used for speech metrics (PR #2444).

New Metrics

Metric Task Type Suitable Benchmarks
exact_match vision-vqa AI2D, ScienceQA, TextVQA, MathVista, MMMU, InterGPS
relaxed_accuracy vision-chart-qa ChartQA (±5% numeric tolerance for numbers)
word_sort_ratio vision-ocr OCR benchmarks (word-level overlap)

Changes

  • olive/evaluator/metric.py: Adds EXACT_MATCH, RELAXED_ACCURACY, WORD_SORT_RATIO to AccuracySubType enum
  • olive/evaluator/accuracy.py: Implements the three metric classes
  • olive/evaluator/olive_evaluator.py: Adds vision inference path and task↔metric validation
  • olive/data/component/pre_process_data.py: Adds vision_vqa_pre_process component
  • olive/data/container/huggingface_container.py: Registers vision-vqa, vision-chart-qa, vision-ocr task types
  • olive/olive_config.json: Adds vision extras (pillow)
  • test/evaluator/test_accuracy.py: 20 unit tests covering all new metrics

Design

  • Vision metrics are text-based (compare predicted answer string to ground truth), task-dependent
  • Task-metric validation ensures incompatible combinations raise ValueError
  • PyTorch path: model processor handles images natively
  • ONNX path: requires user-provided pre-processing to produce numeric tensors

jiafatom and others added 4 commits May 27, 2026 18:13
…rt_ratio)

Add vision evaluation metrics to the Olive evaluator framework, enabling
VQA, ChartQA, and OCR model evaluation.

- exact_match: case-insensitive string equality for VQA tasks
- relaxed_accuracy: ±5% numeric tolerance for ChartQA
- word_sort_ratio: word-level overlap ratio for OCR

Changes:
- olive/evaluator/metric.py: Add EXACT_MATCH, RELAXED_ACCURACY, WORD_SORT_RATIO to AccuracySubType
- olive/evaluator/accuracy.py: Add ExactMatch, RelaxedAccuracy, WordSortRatio classes
- olive/evaluator/olive_evaluator.py: Add _inference_vision() path and task-metric validation
- olive/data/component/pre_process_data.py: Add vision_vqa_pre_process data component
- olive/data/container/huggingface_container.py: Add vision-vqa, vision-chart-qa, vision-ocr tasks
- olive/olive_config.json: Add vision extra dependencies (Pillow)
- test/evaluator/test_accuracy.py: Add 20 unit tests for vision metrics

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Document which datasets each vision metric is suitable for:
- exact_match: AI2D, ScienceQA, TextVQA, MathVista, MMMU, InterGPS
- relaxed_accuracy: ChartQA
- word_sort_ratio: OCR

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Fix _validate_vision_task_metric to extract task from
  pre_process_data_config.params['task'] instead of non-existent
  DataConfig attributes
- Wrap _VISION_ACCURACY_SUBTYPES across multiple lines for lint compliance
- Use lowercase 'pillow' in olive_config.json for consistency
- Add docstring note about ONNX vs PyTorch path for vision_vqa_pre_process

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR extends Olive’s evaluator framework with three vision-oriented, text-based accuracy sub-metrics intended for VQA/ChartQA/OCR-style evaluation, adding corresponding task registrations, data pre-processing, and unit tests.

Changes:

  • Add three new AccuracySubType values (exact_match, relaxed_accuracy, word_sort_ratio) and implement their metric logic.
  • Introduce a vision string-inference path in both ONNX and PyTorch evaluators, including task↔metric compatibility validation.
  • Register new HuggingFace task types and add a vision VQA pre-process component plus unit tests for the new metrics.

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
olive/evaluator/metric.py Adds new vision accuracy sub-types to the enum.
olive/evaluator/accuracy.py Implements ExactMatch, RelaxedAccuracy, and WordSortRatio.
olive/evaluator/olive_evaluator.py Adds vision inference paths and task/metric validation for vision metrics.
olive/data/component/pre_process_data.py Adds vision_vqa_pre_process that emits (image, question) inputs and string answers.
olive/data/container/huggingface_container.py Registers new vision task types mapping to the vision pre-process component.
olive/olive_config.json Adds a vision extra dependency (pillow).
test/evaluator/test_accuracy.py Adds unit tests for all new metric classes.

Comment on lines +45 to +51
DataComponentType.PRE_PROCESS_DATA.value: "vision_vqa_pre_process",
},
"vision-chart-qa": {
DataComponentType.PRE_PROCESS_DATA.value: "vision_vqa_pre_process",
},
"vision-ocr": {
DataComponentType.PRE_PROCESS_DATA.value: "vision_vqa_pre_process",
Comment on lines +399 to +404

Note: This returns raw PIL images and question strings. For the PyTorch evaluator,
the model's own processor/tokenizer should be applied in the post_func or within
the model's forward method. For the ONNX evaluator, provide a custom pre-process
component that applies the appropriate processor/tokenizer to produce numeric
tensors matching the model's io_config.
for batch in dataloader:
input_data, labels = OliveEvaluator.unpack_batch_for_accuracy(batch)
input_feed = format_data(input_data, io_config)
result = model.run_session(session, input_feed, **run_kwargs)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

curious if there are are models that can generate the answer just with a single session run. are these model's conditional generation models + don't need a generation loop?

question = item[self.question_column]
answer = item[self.answer_column]
# Handle list answers (some datasets have multiple valid answers)
if isinstance(answer, list):
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

list or tuple

"""Collate VQA batches. Use with batch_size=1 for variable-size images."""
if len(batch) == 1:
input_dict, answer = batch[0]
return (input_dict, [answer])
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How are "multiple valid answers" handled here? Wouldn't this create a list of lists thus making it a bad input downstream?

Comment on lines +324 to +328
if ref_val == 0:
if pred_val == 0:
correct += 1
elif abs(pred_val - ref_val) / abs(ref_val) <= tolerance:
correct += 1
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Simplify this snippet

if (ref_val == pred_val) or (abs(pred_val - ref_val) / abs(ref_val) <= tolerance):
    correct += 1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants