Skip to content

cortsdine/vqa-probe

Repository files navigation

VQA-Probe: Diagnostic Probes for Visual QA

A diagnostic toolkit for understanding what your VQA model actually fails at.

Python License arXiv CI

Overview

Aggregate VQA accuracy is a famously misleading metric. A model that scores 65% on VQAv2 might be near-perfect at colour identification and yet entirely fail at counting beyond four, or silently invert its answer whenever the question is negated. VQA-Probe is a diagnostic toolkit that decomposes "VQA performance" into a small number of targeted probes, each measuring one specific capability under controlled conditions.

Each probe in VQA-Probe owns its own synthetic data generator, its own prompt template, and its own scoring rule. Probes that need it also emit paired examples (negation, counterfactual) so that consistency — not just accuracy — can be measured. Out of the box the toolkit ships synthetic probes that run on CPU in seconds; for paper-scale evaluation it can also load TextVQA, VQAv2 and GQA via the datasets library.

The library is deliberately small: ~1.5k lines of Python with a clean base class. Adding a new probe is a 40-line exercise (see docs/adding_a_probe.md). The CLI, runner and analysis layer are model-agnostic — we ship wrappers for HuggingFace vision-language models and the major API providers.

Probe Suite

Probe Capability tested Dataset Metric # examples
counting object counting synthetic dot grids exact-match + soft 500
color colour attribution synthetic shapes token match 300
spatial spatial relations synthetic 2-object scenes relation match 400
negation negation sensitivity yes/no pairs accuracy + consistency 300 (x2)
ocr reading text in images rendered strings / TextVQA string CER 500
counterfactual causal visual grounding image-perturbed pairs accuracy + consistency 250 (x2)

The default configs/all_probes.yaml runs all six against a single model in under five minutes on a single A100.

Architecture

        +---------------+        +---------+        +-------+
        |   dataset /   |  -->   |  probe  |  -->   | model |
        |   generator   |        +---------+        +-------+
        +---------------+              |                 |
                                       v                 v
                                +--------------+    +----------+
                                | perturbation |    | answer   |
                                | (cf / negate)|    +----------+
                                +------+-------+         |
                                       |                 |
                                       v                 v
                                  +--------------------------+
                                  |   consistency scorer     |
                                  |   error taxonomy + plots |
                                  +--------------------------+

The Runner walks every example, applies the probe's prompt template, queries the model, scores the prediction, and finally hands the full list of ProbeResult records to the consistency layer.

Installation

pip install vqaprobe
# or, from source:
git clone https://github.com/cortsdine/vqa-probe.git
cd vqa-probe
pip install -e ".[dev]"

For API model wrappers (OpenAI / Anthropic):

pip install -e ".[api]"

VQA-Probe targets Python 3.9+ and is tested on 3.9 / 3.10 / 3.11.

Quick Start

Run a single probe against a HuggingFace model:

from vqaprobe.models import HFVQAModel
from vqaprobe.runner import Runner, RunConfig

model = HFVQAModel("Salesforce/blip2-opt-2.7b", device="cuda", dtype="float16")
runner = Runner(model, RunConfig(probes=["counting"], n_examples=200))
out = runner.run()
print(out["summary"])

Run the full probe suite from a config file:

vqaprobe run -c configs/blip2.yaml -o outputs/blip2.json
vqaprobe analyse outputs/blip2.json

Define a custom probe inline:

from vqaprobe.probes.base import Probe, ProbeExample
from PIL import Image


class YesNoProbe(Probe):
    name = "yesno"
    capability = "binary"

    def load_examples(self):
        for i in range(self.n_examples):
            yield ProbeExample(image=Image.new("RGB", (32, 32), "white"),
                                question="Is the image blank?",
                                answer="yes",
                                meta={"id": f"yn-{i:04d}"})

    def score(self, prediction, example):
        ok = self.normalize(prediction).startswith("yes")
        return ok, 1.0 if ok else 0.0

Adding a New Probe

Adding a probe takes three steps:

  1. Subclass Probe, implementing load_examples() and score().
  2. Register it in vqaprobe/probes/__init__.py's PROBE_REGISTRY.
  3. Document it (one paragraph in docs/probes.md) and add a smoke test under tests/test_probes.py.

A full walk-through with an annotated template lives in docs/adding_a_probe.md.

Reproducing Paper Results

The numbers in the paper were produced with the following sequence:

# 1. Build the real-data shards (TextVQA + VQAv2 counting slices)
python scripts/build_probe_data.py --out-dir data/cache --limit 5000

# 2. Run every config in configs/
bash scripts/run_full_eval.sh

# 3. Render the per-model heatmap + error taxonomy plots
jupyter nbconvert --execute notebooks/analysis_demo.ipynb

All runs use seed 0; rerunning should reproduce the published numbers to within +/-0.4 absolute on every probe.

Sample Results

Accuracy (%) of four open VQA models on the synthetic probe suite, n=300 per probe, seed=0:

Model counting color spatial negation ocr counterfactual
BLIP-2 (OPT-2.7B) 41.3 92.1 71.4 38.9 22.7 45.8
BLIP-2 (FLAN-T5-XL) 46.5 93.4 75.2 44.1 24.5 49.3
InstructBLIP (Vicuna) 52.0 91.8 78.6 53.2 41.2 58.0
LLaVA-1.5 (7B) 48.7 94.5 81.9 45.6 67.4 57.1

Pair consistency (both halves correct) on the paired probes:

Model negation counterfactual
BLIP-2 (OPT-2.7B) 11.4 18.2
BLIP-2 (FLAN-T5-XL) 14.0 22.5
InstructBLIP (Vicuna) 21.7 29.4
LLaVA-1.5 (7B) 17.9 28.0

Failure Mode Analysis

A few patterns stand out from the per-probe and consistency numbers:

  • All four models collapse on counting above 5. Per-count breakdown shows accuracy >85% for counts in {0..3} and <25% for counts >=6 — the headline "48% counting accuracy" hides a sharp cliff.
  • BLIP-2 ignores negation. Pair consistency on the negation probe sits around 12% (vs ~40% headline accuracy), meaning the model is reliably giving the same yes/no answer to both halves of each pair regardless of polarity.
  • LLaVA-1.5 dominates OCR but lags on causal sensitivity. Despite a ~3x lead on the OCR probe, its counterfactual pair consistency is indistinguishable from BLIP-2 — it reads pixels well but doesn't re-ground its answer when those pixels change.

CLI Reference

Command Purpose
vqaprobe run -c CFG -o OUT Run a probe suite defined in YAML config
vqaprobe run -c CFG --limit N Override n_examples for a smoke test
vqaprobe list-probes Print registered probes + capabilities
vqaprobe analyse RESULTS.json Pretty-print a saved results file

All commands respect VQAPROBE_LOG=DEBUG for verbose logging.

Project Structure

vqaprobe/
  probes/        # one file per probe + base class
  perturbations/ # image / text perturbations used by paired probes
  models/        # HuggingFace + API model wrappers
  analysis/      # consistency, error taxonomy, plots
  data/          # synthetic generators + balanced samplers
  utils/         # io, prompts, logging
  runner.py      # orchestrates probe x model
  cli.py         # click entry point
configs/         # YAML run configs (per model)
scripts/         # data builders + full eval driver
docs/            # probe catalogue + how-to-add-a-probe
tests/           # pytest smoke tests
notebooks/       # exploratory analysis notebooks

Citation

@article{chen2025vqaprobe,
  title={VQA-Probe: Diagnostic Probes for Visual Question Answering Models},
  author={Chen, Ruijie},
  journal={arXiv preprint arXiv:2025.14271},
  year={2025}
}

Acknowledgments

  • The HuggingFace datasets and transformers teams for making large-scale multimodal evaluation tractable.
  • PyTorch and PIL for the underlying numerical / image plumbing.
  • The Tsinghua NLP lab for the compute that produced the reported numbers.

License

BSD 3-Clause. See LICENSE for the full text.

The synthetic probe data runs on CPU in seconds, no GPU required for quick iteration.

Tests run on 3.9 / 3.10 / 3.11 via GitHub Actions.

Report reproducibility issues at https://github.com/cortsdine/vqa-probe/issues.


v0.4.1 — May 2026

About

Diagnostic probing toolkit for visual question answering models — counting, color, spatial, negation, OCR, and counterfactual probes.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors