Add vision evaluation metrics (exact_match, relaxed_accuracy, word_sort_ratio)#2476
Add vision evaluation metrics (exact_match, relaxed_accuracy, word_sort_ratio)#2476jiafatom wants to merge 4 commits into
Conversation
…rt_ratio) Add vision evaluation metrics to the Olive evaluator framework, enabling VQA, ChartQA, and OCR model evaluation. - exact_match: case-insensitive string equality for VQA tasks - relaxed_accuracy: ±5% numeric tolerance for ChartQA - word_sort_ratio: word-level overlap ratio for OCR Changes: - olive/evaluator/metric.py: Add EXACT_MATCH, RELAXED_ACCURACY, WORD_SORT_RATIO to AccuracySubType - olive/evaluator/accuracy.py: Add ExactMatch, RelaxedAccuracy, WordSortRatio classes - olive/evaluator/olive_evaluator.py: Add _inference_vision() path and task-metric validation - olive/data/component/pre_process_data.py: Add vision_vqa_pre_process data component - olive/data/container/huggingface_container.py: Add vision-vqa, vision-chart-qa, vision-ocr tasks - olive/olive_config.json: Add vision extra dependencies (Pillow) - test/evaluator/test_accuracy.py: Add 20 unit tests for vision metrics Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Document which datasets each vision metric is suitable for: - exact_match: AI2D, ScienceQA, TextVQA, MathVista, MMMU, InterGPS - relaxed_accuracy: ChartQA - word_sort_ratio: OCR Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Fix _validate_vision_task_metric to extract task from pre_process_data_config.params['task'] instead of non-existent DataConfig attributes - Wrap _VISION_ACCURACY_SUBTYPES across multiple lines for lint compliance - Use lowercase 'pillow' in olive_config.json for consistency - Add docstring note about ONNX vs PyTorch path for vision_vqa_pre_process Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
There was a problem hiding this comment.
Pull request overview
This PR extends Olive’s evaluator framework with three vision-oriented, text-based accuracy sub-metrics intended for VQA/ChartQA/OCR-style evaluation, adding corresponding task registrations, data pre-processing, and unit tests.
Changes:
- Add three new
AccuracySubTypevalues (exact_match,relaxed_accuracy,word_sort_ratio) and implement their metric logic. - Introduce a vision string-inference path in both ONNX and PyTorch evaluators, including task↔metric compatibility validation.
- Register new HuggingFace task types and add a vision VQA pre-process component plus unit tests for the new metrics.
Reviewed changes
Copilot reviewed 7 out of 7 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
olive/evaluator/metric.py |
Adds new vision accuracy sub-types to the enum. |
olive/evaluator/accuracy.py |
Implements ExactMatch, RelaxedAccuracy, and WordSortRatio. |
olive/evaluator/olive_evaluator.py |
Adds vision inference paths and task/metric validation for vision metrics. |
olive/data/component/pre_process_data.py |
Adds vision_vqa_pre_process that emits (image, question) inputs and string answers. |
olive/data/container/huggingface_container.py |
Registers new vision task types mapping to the vision pre-process component. |
olive/olive_config.json |
Adds a vision extra dependency (pillow). |
test/evaluator/test_accuracy.py |
Adds unit tests for all new metric classes. |
| DataComponentType.PRE_PROCESS_DATA.value: "vision_vqa_pre_process", | ||
| }, | ||
| "vision-chart-qa": { | ||
| DataComponentType.PRE_PROCESS_DATA.value: "vision_vqa_pre_process", | ||
| }, | ||
| "vision-ocr": { | ||
| DataComponentType.PRE_PROCESS_DATA.value: "vision_vqa_pre_process", |
|
|
||
| Note: This returns raw PIL images and question strings. For the PyTorch evaluator, | ||
| the model's own processor/tokenizer should be applied in the post_func or within | ||
| the model's forward method. For the ONNX evaluator, provide a custom pre-process | ||
| component that applies the appropriate processor/tokenizer to produce numeric | ||
| tensors matching the model's io_config. |
| for batch in dataloader: | ||
| input_data, labels = OliveEvaluator.unpack_batch_for_accuracy(batch) | ||
| input_feed = format_data(input_data, io_config) | ||
| result = model.run_session(session, input_feed, **run_kwargs) |
There was a problem hiding this comment.
curious if there are are models that can generate the answer just with a single session run. are these model's conditional generation models + don't need a generation loop?
| question = item[self.question_column] | ||
| answer = item[self.answer_column] | ||
| # Handle list answers (some datasets have multiple valid answers) | ||
| if isinstance(answer, list): |
| """Collate VQA batches. Use with batch_size=1 for variable-size images.""" | ||
| if len(batch) == 1: | ||
| input_dict, answer = batch[0] | ||
| return (input_dict, [answer]) |
There was a problem hiding this comment.
How are "multiple valid answers" handled here? Wouldn't this create a list of lists thus making it a bad input downstream?
| if ref_val == 0: | ||
| if pred_val == 0: | ||
| correct += 1 | ||
| elif abs(pred_val - ref_val) / abs(ref_val) <= tolerance: | ||
| correct += 1 |
There was a problem hiding this comment.
nit: Simplify this snippet
if (ref_val == pred_val) or (abs(pred_val - ref_val) / abs(ref_val) <= tolerance):
correct += 1
Summary
Extends Olive's evaluator framework with three vision-oriented accuracy sub-metrics for VQA, ChartQA, and OCR evaluation, following the existing pattern used for speech metrics (PR #2444).
New Metrics
exact_matchvision-vqarelaxed_accuracyvision-chart-qaword_sort_ratiovision-ocrChanges
olive/evaluator/metric.py: AddsEXACT_MATCH,RELAXED_ACCURACY,WORD_SORT_RATIOtoAccuracySubTypeenumolive/evaluator/accuracy.py: Implements the three metric classesolive/evaluator/olive_evaluator.py: Adds vision inference path and task↔metric validationolive/data/component/pre_process_data.py: Addsvision_vqa_pre_processcomponentolive/data/container/huggingface_container.py: Registersvision-vqa,vision-chart-qa,vision-ocrtask typesolive/olive_config.json: Addsvisionextras (pillow)test/evaluator/test_accuracy.py: 20 unit tests covering all new metricsDesign
ValueError