A diagnostic toolkit for understanding what your VQA model actually fails at.
Aggregate VQA accuracy is a famously misleading metric. A model that scores 65% on VQAv2 might be near-perfect at colour identification and yet entirely fail at counting beyond four, or silently invert its answer whenever the question is negated. VQA-Probe is a diagnostic toolkit that decomposes "VQA performance" into a small number of targeted probes, each measuring one specific capability under controlled conditions.
Each probe in VQA-Probe owns its own synthetic data generator, its own
prompt template, and its own scoring rule. Probes that need it also emit
paired examples (negation, counterfactual) so that consistency —
not just accuracy — can be measured. Out of the box the toolkit ships
synthetic probes that run on CPU in seconds; for paper-scale evaluation it
can also load TextVQA, VQAv2 and GQA via the datasets library.
The library is deliberately small: ~1.5k lines of Python with a clean base
class. Adding a new probe is a 40-line exercise (see
docs/adding_a_probe.md). The CLI, runner and
analysis layer are model-agnostic — we ship wrappers for HuggingFace
vision-language models and the major API providers.
| Probe | Capability tested | Dataset | Metric | # examples |
|---|---|---|---|---|
counting |
object counting | synthetic dot grids | exact-match + soft | 500 |
color |
colour attribution | synthetic shapes | token match | 300 |
spatial |
spatial relations | synthetic 2-object scenes | relation match | 400 |
negation |
negation sensitivity | yes/no pairs | accuracy + consistency | 300 (x2) |
ocr |
reading text in images | rendered strings / TextVQA | string CER | 500 |
counterfactual |
causal visual grounding | image-perturbed pairs | accuracy + consistency | 250 (x2) |
The default configs/all_probes.yaml runs all six against a single model in
under five minutes on a single A100.
+---------------+ +---------+ +-------+
| dataset / | --> | probe | --> | model |
| generator | +---------+ +-------+
+---------------+ | |
v v
+--------------+ +----------+
| perturbation | | answer |
| (cf / negate)| +----------+
+------+-------+ |
| |
v v
+--------------------------+
| consistency scorer |
| error taxonomy + plots |
+--------------------------+
The Runner walks every example, applies the probe's prompt template,
queries the model, scores the prediction, and finally hands the full list of
ProbeResult records to the consistency layer.
pip install vqaprobe
# or, from source:
git clone https://github.com/cortsdine/vqa-probe.git
cd vqa-probe
pip install -e ".[dev]"For API model wrappers (OpenAI / Anthropic):
pip install -e ".[api]"VQA-Probe targets Python 3.9+ and is tested on 3.9 / 3.10 / 3.11.
Run a single probe against a HuggingFace model:
from vqaprobe.models import HFVQAModel
from vqaprobe.runner import Runner, RunConfig
model = HFVQAModel("Salesforce/blip2-opt-2.7b", device="cuda", dtype="float16")
runner = Runner(model, RunConfig(probes=["counting"], n_examples=200))
out = runner.run()
print(out["summary"])Run the full probe suite from a config file:
vqaprobe run -c configs/blip2.yaml -o outputs/blip2.json
vqaprobe analyse outputs/blip2.jsonDefine a custom probe inline:
from vqaprobe.probes.base import Probe, ProbeExample
from PIL import Image
class YesNoProbe(Probe):
name = "yesno"
capability = "binary"
def load_examples(self):
for i in range(self.n_examples):
yield ProbeExample(image=Image.new("RGB", (32, 32), "white"),
question="Is the image blank?",
answer="yes",
meta={"id": f"yn-{i:04d}"})
def score(self, prediction, example):
ok = self.normalize(prediction).startswith("yes")
return ok, 1.0 if ok else 0.0Adding a probe takes three steps:
- Subclass
Probe, implementingload_examples()andscore(). - Register it in
vqaprobe/probes/__init__.py'sPROBE_REGISTRY. - Document it (one paragraph in
docs/probes.md) and add a smoke test undertests/test_probes.py.
A full walk-through with an annotated template lives in
docs/adding_a_probe.md.
The numbers in the paper were produced with the following sequence:
# 1. Build the real-data shards (TextVQA + VQAv2 counting slices)
python scripts/build_probe_data.py --out-dir data/cache --limit 5000
# 2. Run every config in configs/
bash scripts/run_full_eval.sh
# 3. Render the per-model heatmap + error taxonomy plots
jupyter nbconvert --execute notebooks/analysis_demo.ipynbAll runs use seed 0; rerunning should reproduce the published numbers to within +/-0.4 absolute on every probe.
Accuracy (%) of four open VQA models on the synthetic probe suite,
n=300 per probe, seed=0:
| Model | counting | color | spatial | negation | ocr | counterfactual |
|---|---|---|---|---|---|---|
| BLIP-2 (OPT-2.7B) | 41.3 | 92.1 | 71.4 | 38.9 | 22.7 | 45.8 |
| BLIP-2 (FLAN-T5-XL) | 46.5 | 93.4 | 75.2 | 44.1 | 24.5 | 49.3 |
| InstructBLIP (Vicuna) | 52.0 | 91.8 | 78.6 | 53.2 | 41.2 | 58.0 |
| LLaVA-1.5 (7B) | 48.7 | 94.5 | 81.9 | 45.6 | 67.4 | 57.1 |
Pair consistency (both halves correct) on the paired probes:
| Model | negation | counterfactual |
|---|---|---|
| BLIP-2 (OPT-2.7B) | 11.4 | 18.2 |
| BLIP-2 (FLAN-T5-XL) | 14.0 | 22.5 |
| InstructBLIP (Vicuna) | 21.7 | 29.4 |
| LLaVA-1.5 (7B) | 17.9 | 28.0 |
A few patterns stand out from the per-probe and consistency numbers:
- All four models collapse on counting above 5. Per-count breakdown shows accuracy >85% for counts in {0..3} and <25% for counts >=6 — the headline "48% counting accuracy" hides a sharp cliff.
- BLIP-2 ignores negation. Pair consistency on the negation probe sits around 12% (vs ~40% headline accuracy), meaning the model is reliably giving the same yes/no answer to both halves of each pair regardless of polarity.
- LLaVA-1.5 dominates OCR but lags on causal sensitivity. Despite a ~3x lead on the OCR probe, its counterfactual pair consistency is indistinguishable from BLIP-2 — it reads pixels well but doesn't re-ground its answer when those pixels change.
| Command | Purpose |
|---|---|
vqaprobe run -c CFG -o OUT |
Run a probe suite defined in YAML config |
vqaprobe run -c CFG --limit N |
Override n_examples for a smoke test |
vqaprobe list-probes |
Print registered probes + capabilities |
vqaprobe analyse RESULTS.json |
Pretty-print a saved results file |
All commands respect VQAPROBE_LOG=DEBUG for verbose logging.
vqaprobe/
probes/ # one file per probe + base class
perturbations/ # image / text perturbations used by paired probes
models/ # HuggingFace + API model wrappers
analysis/ # consistency, error taxonomy, plots
data/ # synthetic generators + balanced samplers
utils/ # io, prompts, logging
runner.py # orchestrates probe x model
cli.py # click entry point
configs/ # YAML run configs (per model)
scripts/ # data builders + full eval driver
docs/ # probe catalogue + how-to-add-a-probe
tests/ # pytest smoke tests
notebooks/ # exploratory analysis notebooks
@article{chen2025vqaprobe,
title={VQA-Probe: Diagnostic Probes for Visual Question Answering Models},
author={Chen, Ruijie},
journal={arXiv preprint arXiv:2025.14271},
year={2025}
}- The HuggingFace
datasetsandtransformersteams for making large-scale multimodal evaluation tractable. - PyTorch and PIL for the underlying numerical / image plumbing.
- The Tsinghua NLP lab for the compute that produced the reported numbers.
BSD 3-Clause. See LICENSE for the full text.
The synthetic probe data runs on CPU in seconds, no GPU required for quick iteration.
Tests run on 3.9 / 3.10 / 3.11 via GitHub Actions.
Report reproducibility issues at https://github.com/cortsdine/vqa-probe/issues.
v0.4.1 — May 2026