VQA-Probe: Diagnostic Probes for Visual QA

A diagnostic toolkit for understanding what your VQA model actually fails at.

Overview

Aggregate VQA accuracy is a famously misleading metric. A model that scores 65% on VQAv2 might be near-perfect at colour identification and yet entirely fail at counting beyond four, or silently invert its answer whenever the question is negated. VQA-Probe is a diagnostic toolkit that decomposes "VQA performance" into a small number of targeted probes, each measuring one specific capability under controlled conditions.

Each probe in VQA-Probe owns its own synthetic data generator, its own prompt template, and its own scoring rule. Probes that need it also emit paired examples (negation, counterfactual) so that consistency — not just accuracy — can be measured. Out of the box the toolkit ships synthetic probes that run on CPU in seconds; for paper-scale evaluation it can also load TextVQA, VQAv2 and GQA via the datasets library.

The library is deliberately small: ~1.5k lines of Python with a clean base class. Adding a new probe is a 40-line exercise (see docs/adding_a_probe.md). The CLI, runner and analysis layer are model-agnostic — we ship wrappers for HuggingFace vision-language models and the major API providers.

Probe Suite

Probe	Capability tested	Dataset	Metric	# examples
`counting`	object counting	synthetic dot grids	exact-match + soft	500
`color`	colour attribution	synthetic shapes	token match	300
`spatial`	spatial relations	synthetic 2-object scenes	relation match	400
`negation`	negation sensitivity	yes/no pairs	accuracy + consistency	300 (x2)
`ocr`	reading text in images	rendered strings / TextVQA	string CER	500
`counterfactual`	causal visual grounding	image-perturbed pairs	accuracy + consistency	250 (x2)

The default configs/all_probes.yaml runs all six against a single model in under five minutes on a single A100.

Architecture

        +---------------+        +---------+        +-------+
        |   dataset /   |  -->   |  probe  |  -->   | model |
        |   generator   |        +---------+        +-------+
        +---------------+              |                 |
                                       v                 v
                                +--------------+    +----------+
                                | perturbation |    | answer   |
                                | (cf / negate)|    +----------+
                                +------+-------+         |
                                       |                 |
                                       v                 v
                                  +--------------------------+
                                  |   consistency scorer     |
                                  |   error taxonomy + plots |
                                  +--------------------------+

The Runner walks every example, applies the probe's prompt template, queries the model, scores the prediction, and finally hands the full list of ProbeResult records to the consistency layer.

Installation

pip install vqaprobe
# or, from source:
git clone https://github.com/cortsdine/vqa-probe.git
cd vqa-probe
pip install -e ".[dev]"

For API model wrappers (OpenAI / Anthropic):

pip install -e ".[api]"

VQA-Probe targets Python 3.9+ and is tested on 3.9 / 3.10 / 3.11.

Quick Start

Run a single probe against a HuggingFace model:

from vqaprobe.models import HFVQAModel
from vqaprobe.runner import Runner, RunConfig

model = HFVQAModel("Salesforce/blip2-opt-2.7b", device="cuda", dtype="float16")
runner = Runner(model, RunConfig(probes=["counting"], n_examples=200))
out = runner.run()
print(out["summary"])

Run the full probe suite from a config file:

vqaprobe run -c configs/blip2.yaml -o outputs/blip2.json
vqaprobe analyse outputs/blip2.json

Define a custom probe inline:

from vqaprobe.probes.base import Probe, ProbeExample
from PIL import Image


class YesNoProbe(Probe):
    name = "yesno"
    capability = "binary"

    def load_examples(self):
        for i in range(self.n_examples):
            yield ProbeExample(image=Image.new("RGB", (32, 32), "white"),
                                question="Is the image blank?",
                                answer="yes",
                                meta={"id": f"yn-{i:04d}"})

    def score(self, prediction, example):
        ok = self.normalize(prediction).startswith("yes")
        return ok, 1.0 if ok else 0.0

Adding a New Probe

Adding a probe takes three steps:

Subclass Probe, implementing load_examples() and score().
Register it in vqaprobe/probes/__init__.py's PROBE_REGISTRY.
Document it (one paragraph in docs/probes.md) and add a smoke test under tests/test_probes.py.

A full walk-through with an annotated template lives in docs/adding_a_probe.md.

Reproducing Paper Results

The numbers in the paper were produced with the following sequence:

# 1. Build the real-data shards (TextVQA + VQAv2 counting slices)
python scripts/build_probe_data.py --out-dir data/cache --limit 5000

# 2. Run every config in configs/
bash scripts/run_full_eval.sh

# 3. Render the per-model heatmap + error taxonomy plots
jupyter nbconvert --execute notebooks/analysis_demo.ipynb

All runs use seed 0; rerunning should reproduce the published numbers to within +/-0.4 absolute on every probe.

Sample Results

Accuracy (%) of four open VQA models on the synthetic probe suite, n=300 per probe, seed=0:

Model	counting	color	spatial	negation	ocr	counterfactual
BLIP-2 (OPT-2.7B)	41.3	92.1	71.4	38.9	22.7	45.8
BLIP-2 (FLAN-T5-XL)	46.5	93.4	75.2	44.1	24.5	49.3
InstructBLIP (Vicuna)	52.0	91.8	78.6	53.2	41.2	58.0
LLaVA-1.5 (7B)	48.7	94.5	81.9	45.6	67.4	57.1

Pair consistency (both halves correct) on the paired probes:

Model	negation	counterfactual
BLIP-2 (OPT-2.7B)	11.4	18.2
BLIP-2 (FLAN-T5-XL)	14.0	22.5
InstructBLIP (Vicuna)	21.7	29.4
LLaVA-1.5 (7B)	17.9	28.0

Failure Mode Analysis

A few patterns stand out from the per-probe and consistency numbers:

All four models collapse on counting above 5. Per-count breakdown shows accuracy >85% for counts in {0..3} and <25% for counts >=6 — the headline "48% counting accuracy" hides a sharp cliff.
BLIP-2 ignores negation. Pair consistency on the negation probe sits around 12% (vs ~40% headline accuracy), meaning the model is reliably giving the same yes/no answer to both halves of each pair regardless of polarity.
LLaVA-1.5 dominates OCR but lags on causal sensitivity. Despite a ~3x lead on the OCR probe, its counterfactual pair consistency is indistinguishable from BLIP-2 — it reads pixels well but doesn't re-ground its answer when those pixels change.

CLI Reference

Command	Purpose
`vqaprobe run -c CFG -o OUT`	Run a probe suite defined in YAML config
`vqaprobe run -c CFG --limit N`	Override `n_examples` for a smoke test
`vqaprobe list-probes`	Print registered probes + capabilities
`vqaprobe analyse RESULTS.json`	Pretty-print a saved results file

All commands respect VQAPROBE_LOG=DEBUG for verbose logging.

Project Structure

vqaprobe/
  probes/        # one file per probe + base class
  perturbations/ # image / text perturbations used by paired probes
  models/        # HuggingFace + API model wrappers
  analysis/      # consistency, error taxonomy, plots
  data/          # synthetic generators + balanced samplers
  utils/         # io, prompts, logging
  runner.py      # orchestrates probe x model
  cli.py         # click entry point
configs/         # YAML run configs (per model)
scripts/         # data builders + full eval driver
docs/            # probe catalogue + how-to-add-a-probe
tests/           # pytest smoke tests
notebooks/       # exploratory analysis notebooks

Citation

@article{chen2025vqaprobe,
  title={VQA-Probe: Diagnostic Probes for Visual Question Answering Models},
  author={Chen, Ruijie},
  journal={arXiv preprint arXiv:2025.14271},
  year={2025}
}

Acknowledgments

The HuggingFace datasets and transformers teams for making large-scale multimodal evaluation tractable.
PyTorch and PIL for the underlying numerical / image plumbing.
The Tsinghua NLP lab for the compute that produced the reported numbers.

License

BSD 3-Clause. See LICENSE for the full text.

The synthetic probe data runs on CPU in seconds, no GPU required for quick iteration.

Tests run on 3.9 / 3.10 / 3.11 via GitHub Actions.

Report reproducibility issues at https://github.com/cortsdine/vqa-probe/issues.

v0.4.1 — May 2026

Name		Name	Last commit message	Last commit date
Latest commit History 78 Commits
.github/workflows		.github/workflows
configs		configs
docs		docs
notebooks		notebooks
scripts		scripts
tests		tests
vqaprobe		vqaprobe
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

VQA-Probe: Diagnostic Probes for Visual QA

Overview

Probe Suite

Architecture

Installation

Quick Start

Adding a New Probe

Reproducing Paper Results

Sample Results

Failure Mode Analysis

CLI Reference

Project Structure

Citation

Acknowledgments

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

VQA-Probe: Diagnostic Probes for Visual QA

Overview

Probe Suite

Architecture

Installation

Quick Start

Adding a New Probe

Reproducing Paper Results

Sample Results

Failure Mode Analysis

CLI Reference

Project Structure

Citation

Acknowledgments

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages