SciVQR: A Multidisciplinary Multimodal Benchmark for Advanced Scientific Reasoning Evaluation

Longteng Guo*, Xuanxu Lin*, Dongze Hao*, Tongtian Yue, Pengkang Huo, Jiatong Ma, Yuchen Liu, Jing Liu (*Equal Contribution)

🌕 Abstract

Scientific reasoning is a key aspect of human intelligence, requiring the integration of multimodal inputs, domain expertise, and multi-step inference across various subjects. Existing benchmarks for multimodal large language models (MLLMs) often fail to capture the complexity and traceability of reasoning processes necessary for rigorous evaluation. To fill this gap, we introduce SciVQR, a multimodal benchmark covering 54 subfields in mathematics, physics, chemistry, geography, astronomy, and biology. SciVQR includes domain-specific visuals, such as equations, charts, and diagrams, and challenges models to combine visual comprehension with reasoning. The tasks range from basic factual recall to complex, multi-step inferences, with 46% including expert-authored solutions. SciVQR not only evaluates final answers but also examines the reasoning process, providing insights into how models reach their conclusions. Our evaluation of leading MLLMs, including both proprietary and open-source models, reveals significant limitations in handling complex multimodal reasoning tasks, underscoring the need for improved multi-step reasoning and better integration of interdisciplinary knowledge in advancing MLLMs toward true scientific intelligence.

🌖 SciVQR Benchmark

SciVQR covers 6 core scientific domains: Mathematics, Physics, Chemistry, Geography, Astronomy, and Biology. It challenges models to integrate fine-grained visual understanding, deep subject knowledge, and sophisticated reasoning.

🌗 Dataset (Hosted on Hugging Face)

Note: This repository contains only the evaluation code. The full dataset is hosted on Hugging Face.

The dataset contains 3,254 multimodal questions, with 46% accompanied by detailed, expert-authored solution traces. You can easily load the dataset using the datasets library:

from datasets import load_dataset
dataset = load_dataset("l205/SciVQR", split="train")

🌘 Evaluation Pipeline

We provide three core scripts to comprehensively evaluate MLLMs on the SciVQR benchmark:

evaluate_multichoice.py: For rule-based and symbolic equivalence evaluation of multiple-choice questions.
evaluate_open.py: For evaluating open-ended free-form questions using an LLM-as-a-judge approach.
evaluate_reasoning.py: For fine-grained evaluation of CoT reasoning quality across 5 dimensions (Faithfulness, Informativeness, Redundancy, Hallucination, Missing Steps).

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
code		code
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SciVQR: A Multidisciplinary Multimodal Benchmark for Advanced Scientific Reasoning Evaluation

🌕 Abstract

🌖 SciVQR Benchmark

🌗 Dataset (Hosted on Hugging Face)

🌘 Evaluation Pipeline

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

SciVQR: A Multidisciplinary Multimodal Benchmark for Advanced Scientific Reasoning Evaluation

🌕 Abstract

🌖 SciVQR Benchmark

🌗 Dataset (Hosted on Hugging Face)

🌘 Evaluation Pipeline

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages