Feature/refcoco/+/g benchmark support by zhongzhouTan-coder · Pull Request #201 · AISBench/benchmark

zhongzhouTan-coder · 2026-03-18T09:40:25Z

Thanks for your contribution; we appreciate it a lot. The following instructions will make your pull request healthier and help you get feedback more easily. If you do not understand some items, don't worry, just make the pull request and seek help from maintainers.
感谢您的贡献，我们非常重视。以下说明将使您的拉取请求更健康，更易于获得反馈。如果您不理解某些项目，请不要担心，只需提交拉取请求并从维护人员那里寻求帮助即可。

PR Type / PR类型

Related Issue | 关联 Issue
Fixes #(issue ID / issue 编号) / Relates to #(issue ID / issue 编号)

🔍 Motivation / 变更动机

This pull request introduces comprehensive support for the RefCOCO, RefCOCOg, and RefCOCOplus visual grounding datasets in the benchmarking framework. It adds dataset loaders, configuration files, and a bounding box IoU evaluator for multimodal tasks involving object localization in images. The main changes include new dataset classes, prompt templates, evaluation logic, and integration into the registry system.

📝 Modification / 修改内容

This pull request adds comprehensive support for the RefCOCO, RefCOCOg, and RefCOCOplus referring expression comprehension datasets to the benchmarking suite. It introduces modular and configurable dataset loaders, evaluation configurations, and prompt templates for both file-based and base64-encoded image formats. The changes also include dataset registration, utility functions for image handling, and integration with the evaluation pipeline.

Key changes include:

New Dataset Loaders and Utilities:

Added RefCOCODataset and its variants (RefCOCOgDataset, RefCOCOPlusDataset) with support for both file path and base64 image encoding, including modular image resolver strategies and methods for loading, normalizing, and expanding dataset rows.
Registered dataset classes and utility constants in the package init files for easy import and discovery.

Benchmark Configuration Additions:

Added configuration files for each dataset and image encoding variant, specifying prompt templates, retriever/inferencer setup, and evaluation logic for RefCOCO, RefCOCOg, and RefCOCOplus.

Evaluation and Postprocessing Integration:

Implemented and registered a bounding box IoU evaluator (BBoxIoUEvaluator) and a postprocessing function (refcoco_bbox_postprocess) for extracting and normalizing predicted bounding boxes from model outputs.

These changes provide a robust and extensible foundation for benchmarking multimodal models on referring expression comprehension tasks using the RefCOCO family of datasets.

📐 Associated Test Results / 关联测试结果

[dataset usage]

[ut test]

gemini-code-assist · 2026-03-18T09:40:49Z

Summary of Changes

Hello, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly expands the benchmarking framework's capabilities by integrating the RefCOCO family of visual grounding datasets. It provides a complete pipeline from data loading and prompt generation to evaluation, enabling comprehensive assessment of models on multimodal object localization tasks. The changes introduce new dataset classes, configuration files, and a dedicated bounding box Intersection over Union (IoU) evaluator, along with necessary post-processing logic.

Highlights

New Dataset Support: Added comprehensive support for RefCOCO, RefCOCOg, and RefCOCOplus visual grounding datasets, including dataset loaders and configuration files for various splits.
Bounding Box IoU Evaluation: Introduced a new BBoxIoUEvaluator to accurately score predicted bounding boxes against ground truth using Intersection over Union (IoU), with configurable thresholds and coordinate scaling.
Bounding Box Post-processing: Implemented a refcoco_bbox_postprocess function to extract and parse bounding box coordinates from model outputs, ensuring compatibility with the new evaluator.
Framework Integration: Integrated all new datasets and the IoU evaluator into the benchmarking framework's registry, making them discoverable and usable within the existing system.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces comprehensive support for the RefCOCO, RefCOCOg, and RefCOCOplus datasets, which is a great addition to the benchmark framework. The implementation, including the new dataset loaders, the BBoxIoUEvaluator, and the associated unit tests, is well-structured and robust.

My review has identified a couple of issues. There's a critical configuration mismatch in the refcoco_plus and refcocog dataset configurations where the input_columns do not align with the variables required by the prompt template, which would cause runtime errors. I've also pointed out a minor stylistic issue concerning the lack of a final newline character in several of the newly added files.

After addressing these points, the PR should be in excellent shape.

ais_bench/benchmark/configs/datasets/refcoco_plus/refcoco_plus_gen.py

ais_bench/benchmark/configs/datasets/refcocog/refcocog_gen.py

ais_bench/benchmark/openicl/icl_evaluator/bbox_iou_evaluator.py

Copilot

Pull request overview

Adds RefCOCO-family dataset support and a bounding-box IoU evaluator to enable referring-expression grounding benchmarks (RefCOCO / RefCOCO+ / RefCOCOg) within the OpenICL evaluation pipeline.

Changes:

Introduces BBoxIoUEvaluator (registry-registered) for IoU-based accuracy scoring with optional coordinate scaling/clipping.
Adds RefCOCODataset loader plus RefCOCOPlusDataset / RefCOCOgDataset variants, along with bbox prediction postprocessing.
Adds dataset config entries and unit tests covering dataset loading, registry wiring, and evaluator behavior.

Reviewed changes

Copilot reviewed 14 out of 14 changed files in this pull request and generated 8 comments.

Show a summary per file

File	Description
`ais_bench/benchmark/openicl/icl_evaluator/bbox_iou_evaluator.py`	New IoU evaluator for bbox predictions (with scaling/clipping + detailed per-sample results).
`ais_bench/benchmark/openicl/icl_evaluator/__init__.py`	Exposes the new evaluator for import-side registration.
`ais_bench/benchmark/datasets/refcoco/refcoco.py`	New RefCOCO dataset loader + bbox postprocessor and prompt template.
`ais_bench/benchmark/datasets/refcoco/refcoco_plus.py`	RefCOCO+ dataset variant registered via inheritance.
`ais_bench/benchmark/datasets/refcoco/refcoco_g.py`	RefCOCOg dataset variant registered via inheritance.
`ais_bench/benchmark/datasets/refcoco/__init__.py`	RefCOCO package exports.
`ais_bench/benchmark/datasets/__init__.py`	Imports RefCOCO package to ensure registration.
`ais_bench/benchmark/configs/datasets/refcoco/refcoco_gen.py`	New RefCOCO generation config using `BBoxIoUEvaluator` + bbox postprocessor.
`ais_bench/benchmark/configs/datasets/refcocog/refcocog_gen.py`	New RefCOCOg generation config.
`ais_bench/benchmark/configs/datasets/refcoco_plus/refcoco_plus_gen.py`	New RefCOCO+ generation config.
`tests/UT/openicl/icl_evaluator/test_bbox_iou_evaluator.py`	Unit tests for IoU evaluator scoring, clipping, error paths, and registry registration.
`tests/UT/datasets/refcoco/test_refcoco.py`	Unit tests for RefCOCO loader behavior and bbox postprocessor registration.
`tests/UT/datasets/refcoco/test_refcoco_plus.py`	Unit tests for RefCOCO+ delegation + registry registration.
`tests/UT/datasets/refcoco/test_refcocog.py`	Unit tests for RefCOCOg delegation + registry registration.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

ais_bench/benchmark/datasets/refcoco/refcoco.py

tests/UT/datasets/refcoco/test_refcocog.py

tests/UT/datasets/refcoco/test_refcoco_plus.py

tests/UT/datasets/refcoco/test_refcoco.py

ais_bench/benchmark/datasets/refcoco/refcoco.py

ais_bench/benchmark/configs/datasets/refcocog/refcocog_gen.py

ais_bench/benchmark/configs/datasets/refcoco_plus/refcoco_plus_gen.py

ais_bench/benchmark/configs/datasets/refcoco/refcoco_gen.py

ais_bench/benchmark/configs/datasets/refcoco_plus/refcoco_plus_gen.py

ais_bench/benchmark/configs/datasets/refcocog/refcocog_gen.py

ais_bench/benchmark/datasets/refcoco/refcoco.py

Copilot

Pull request overview

Adds RefCOCO/RefCOCO+/RefCOCOg visual grounding dataset support to the ais_bench benchmarking framework, including dataset loaders, evaluation via bbox IoU, prompt/postprocessing integration, and accompanying unit tests.

Changes:

Introduce RefCOCO-family dataset loaders with image path/base64 modes and bbox-answer normalization.
Add BBoxIoUEvaluator for IoU-thresholded accuracy plus framework registry integration.
Add dataset config presets (path + base64 variants) and unit tests for loaders/evaluator/registry wiring.

Reviewed changes

Copilot reviewed 17 out of 17 changed files in this pull request and generated 4 comments.

Show a summary per file

File	Description
tests/UT/openicl/icl_evaluator/test_bbox_iou_evaluator.py	Unit tests for IoU evaluator scoring, scaling/clipping, invalid cases, and registry registration.
tests/UT/datasets/refcoco/test_refcocog.py	Unit tests for RefCOCOg loader delegation and registry registration.
tests/UT/datasets/refcoco/test_refcoco.py	Unit tests for RefCOCO loader row expansion, image handling (path/base64), and bbox postprocessor registration.
tests/UT/datasets/refcoco/test_refcoco_plus.py	Unit tests for RefCOCO+ loader delegation and registry registration.
ais_bench/benchmark/openicl/icl_evaluator/bbox_iou_evaluator.py	New IoU-based evaluator for bbox predictions with scaling/clipping and per-sample details.
ais_bench/benchmark/openicl/icl_evaluator/init.py	Expose `BBoxIoUEvaluator` from evaluator package.
ais_bench/benchmark/datasets/refcoco/refcoco.py	Core RefCOCO loader, image resolver strategies (path/base64), bbox postprocessor, and prompt generation.
ais_bench/benchmark/datasets/refcoco/refcoco_plus.py	RefCOCO+ dataset class reusing RefCOCO loader.
ais_bench/benchmark/datasets/refcoco/refcoco_g.py	RefCOCOg dataset class reusing RefCOCO loader.
ais_bench/benchmark/datasets/refcoco/init.py	Package exports for RefCOCO datasets and helpers.
ais_bench/benchmark/datasets/init.py	Register RefCOCO datasets into datasets package star-imports.
ais_bench/benchmark/configs/datasets/refcocog/refcocog_gen.py	RefCOCOg generation config (file-path images).
ais_bench/benchmark/configs/datasets/refcocog/refcocog_gen_base64.py	RefCOCOg generation config (base64 images).
ais_bench/benchmark/configs/datasets/refcoco/refcoco_gen.py	RefCOCO generation config (file-path images).
ais_bench/benchmark/configs/datasets/refcoco/refcoco_gen_base64.py	RefCOCO generation config (base64 images).
ais_bench/benchmark/configs/datasets/refcoco_plus/refcoco_plus_gen.py	RefCOCO+ generation config (file-path images).
ais_bench/benchmark/configs/datasets/refcoco_plus/refcoco_plus_gen_base64.py	RefCOCO+ generation config (base64 images).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

ais_bench/benchmark/openicl/icl_evaluator/bbox_iou_evaluator.py

ais_bench/benchmark/datasets/refcoco/refcoco.py

ais_bench/benchmark/openicl/icl_evaluator/bbox_iou_evaluator.py

tests/UT/datasets/refcoco/test_refcoco.py

Copilot

Pull request overview

This PR adds RefCOCO-family visual grounding benchmark support to the AISBench benchmarking framework, including dataset loaders, configs, and an IoU-based bounding-box evaluator for localization-style multimodal tasks.

Changes:

Added RefCOCODataset loader with image caching/base64 options plus a bbox extraction postprocessor; introduced RefCOCOgDataset / RefCOCOPlusDataset variants.
Added BBoxIoUEvaluator to score predicted boxes against ground-truth boxes using IoU.
Added dataset config files (path + base64 variants) and registered new datasets/evaluator via package __init__ imports.

Reviewed changes

Copilot reviewed 13 out of 13 changed files in this pull request and generated 2 comments.

Show a summary per file

File	Description
ais_bench/benchmark/openicl/icl_evaluator/bbox_iou_evaluator.py	New IoU evaluator for bbox localization tasks.
ais_bench/benchmark/openicl/icl_evaluator/init.py	Exposes the new evaluator for discovery/import.
ais_bench/benchmark/datasets/refcoco/refcoco.py	Core RefCOCO loader + bbox postprocessor + image resolution logic.
ais_bench/benchmark/datasets/refcoco/refcoco_g.py	RefCOCOg dataset variant (reuses RefCOCO loader).
ais_bench/benchmark/datasets/refcoco/refcoco_plus.py	RefCOCOPlus dataset variant (reuses RefCOCO loader).
ais_bench/benchmark/datasets/refcoco/init.py	Re-exports new datasets/constants/postprocessor.
ais_bench/benchmark/datasets/init.py	Imports RefCOCO module to register/expose datasets.
ais_bench/benchmark/configs/datasets/refcoco/refcoco_gen.py	RefCOCO path-based generation config wired to bbox evaluator.
ais_bench/benchmark/configs/datasets/refcoco/refcoco_gen_base64.py	RefCOCO base64 generation config wired to bbox evaluator.
ais_bench/benchmark/configs/datasets/refcocog/refcocog_gen.py	RefCOCOg path-based generation config.
ais_bench/benchmark/configs/datasets/refcocog/refcocog_gen_base64.py	RefCOCOg base64 generation config.
ais_bench/benchmark/configs/datasets/refcoco_plus/refcoco_plus_gen.py	RefCOCOPlus path-based generation config.
ais_bench/benchmark/configs/datasets/refcoco_plus/refcoco_plus_gen_base64.py	RefCOCOPlus base64 generation config.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

ais_bench/benchmark/datasets/refcoco/refcoco.py

ais_bench/benchmark/openicl/icl_evaluator/bbox_iou_evaluator.py

…datasets config

zhongzhouTan-coder added 5 commits March 18, 2026 17:30

[feature] add refcoco support

59facd2

[feat] save image to local disk instead of storing in share memory

f003fbd

[feature] add refcoco plus support

224c731

[feature] add refcocog support

8d180d5

[refactor] use the more general dir name for the saving images

8aaa9fa

Copilot AI review requested due to automatic review settings March 18, 2026 09:40

zhongzhouTan-coder temporarily deployed to smoke-test-approval March 18, 2026 09:40 — with GitHub Actions Inactive

github-actions bot added the feature label Mar 18, 2026

zhongzhouTan-coder requested a review from GaoHuaZhang March 18, 2026 09:40

zhongzhouTan-coder requested a review from SJTUyh March 18, 2026 09:40

Copilot started reviewing on behalf of zhongzhouTan-coder March 18, 2026 09:41 View session

gemini-code-assist bot reviewed Mar 18, 2026

View reviewed changes

ais_bench/benchmark/configs/datasets/refcoco_plus/refcoco_plus_gen.py Outdated Show resolved Hide resolved

ais_bench/benchmark/configs/datasets/refcocog/refcocog_gen.py Outdated Show resolved Hide resolved

ais_bench/benchmark/openicl/icl_evaluator/bbox_iou_evaluator.py Show resolved Hide resolved

Copilot AI reviewed Mar 18, 2026

View reviewed changes

zhongzhouTan-coder requested a deployment to smoke-test-approval March 19, 2026 01:36 — with GitHub Actions In progress

SJTUyh reviewed Mar 19, 2026

View reviewed changes

ais_bench/benchmark/configs/datasets/refcoco/refcoco_gen.py Show resolved Hide resolved

ais_bench/benchmark/configs/datasets/refcoco_plus/refcoco_plus_gen.py Show resolved Hide resolved

SJTUyh reviewed Mar 19, 2026

View reviewed changes

ais_bench/benchmark/configs/datasets/refcocog/refcocog_gen.py Show resolved Hide resolved

ais_bench/benchmark/datasets/refcoco/refcoco.py Outdated Show resolved Hide resolved

[feature] add refcoco/+/g base64 support

b2f1bac

Copilot AI review requested due to automatic review settings March 19, 2026 11:28

zhongzhouTan-coder force-pushed the feature/refcoco branch from 41f9c02 to e2ce2b5 Compare March 19, 2026 11:28

zhongzhouTan-coder temporarily deployed to smoke-test-approval March 19, 2026 11:28 — with GitHub Actions Inactive

Copilot AI reviewed Mar 19, 2026

View reviewed changes

zhongzhouTan-coder force-pushed the feature/refcoco branch from e2ce2b5 to b2f1bac Compare March 20, 2026 01:57

zhongzhouTan-coder temporarily deployed to smoke-test-approval March 20, 2026 01:57 — with GitHub Actions Inactive

[refactor] avoid index error and type error to raise to user

7d6802e

Copilot AI review requested due to automatic review settings March 20, 2026 02:17

zhongzhouTan-coder temporarily deployed to smoke-test-approval March 20, 2026 02:17 — with GitHub Actions Inactive

Copilot started reviewing on behalf of zhongzhouTan-coder March 20, 2026 02:17 View session

Copilot AI reviewed Mar 20, 2026

View reviewed changes

ais_bench/benchmark/datasets/refcoco/refcoco.py Show resolved Hide resolved

ais_bench/benchmark/openicl/icl_evaluator/bbox_iou_evaluator.py Show resolved Hide resolved

[refactor] remove unused image and question, also move the prompt to …

771f195

…datasets config

zhongzhouTan-coder temporarily deployed to smoke-test-approval March 20, 2026 07:16 — with GitHub Actions Inactive

Conversation

zhongzhouTan-coder commented Mar 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔍 Motivation / 变更动机

📝 Modification / 修改内容

📐 Associated Test Results / 关联测试结果

Uh oh!

gemini-code-assist bot commented Mar 18, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

zhongzhouTan-coder commented Mar 18, 2026 •

edited

Loading