DataFlow-VQA

中文文档

A pipeline for extracting, curating, and generating chain-of-thought (CoT) data from PDF textbooks and exam papers.

📄Full Paper with Appendices 🤗Dataset 🤗FlipVQA-Miner Demo

Overview

DataFlow-VQA processes PDF documents through three sequential stages:

Stage1 (Section 3.1: VQA Extraction): Parses PDFs using MinerU for document layout analysis, then uses an LLM to extract structured question-answer pairs with images.
Stage2 (Section 3.2.1 to Section 3.2.5: Data Curation): Filters and cleans the extracted QA pairs — splits sub-questions, classifies question types, extracts concise answers, and removes low-quality items.
Stage3 (Section 3.2.6: CoT Generation): Generates chain-of-thought reasoning via reject sampling — an LLM generates answers, which are verified against ground truth, and incorrect ones are retried.

Installation

This project is built on top of DataFlow. Clone and install it first:

git clone https://github.com/OpenDCAI/DataFlow.git
cd DataFlow
pip install -e ".[pdf2vqa]"

Then clone this repository:

git clone <this-repo-url>
cd DataFlow-VQA

Configuration

API Keys

Two API keys are required:

DF_API_KEY: API key for the LLM service (OpenAI, Google Gemini, DeepSeek, etc.)
MINERU_API_KEY: API key for MinerU document layout parsing

export DF_API_KEY="sk-xxxxx"
export MINERU_API_KEY="sk2-xxxxx"

LLM Endpoint

Each pipeline accepts --api_url and --model arguments. Any OpenAI-compatible API endpoint is supported, including OpenAI, Google Gemini (via proxy), DeepSeek, and others.

Provide the base URL without /chat/completions (e.g. https://api.openai.com/v1).

Stage 1: VQA Extraction

Input Format

Create a JSONL file where each line describes one PDF extraction task:

{"input_pdf_paths": "./examples/VQA/questionextract_test.pdf", "name": "math1"}
{"input_pdf_paths": ["./examples/VQA/math_question.pdf", "./examples/VQA/math_answer.pdf"], "name": "math2"}

input_pdf_paths: A single PDF (questions and answers interleaved) or a list of two or more PDFs (questions before answers).
name: A unique identifier for this task (used for directory naming and caching).

Run

python -m pipelines.vqa_extract_optimized_pipeline \
    --input_file ./examples/VQA/vqa_extract_test.jsonl \
    --output_dir ./output \
    --api_url https://generativelanguage.googleapis.com/v1beta/openai/ \
    --model gemini-2.5-pro

Important: We recommend using a strong powerful model here. Weak models like gpt-5-mini might perform bad.

Output

{output_dir}/raw_vqa.jsonl: Extracted QA pairs with image references
{output_dir}/{name}/vqa_images/: Extracted images
cache/{name}/extracted_vqa.jsonl, merged_qa_pairs.jsonl, merged_qa_pairs.md: Per-task intermediate files

Each QA item contains:

{
  "question": "Compute $x$ such that $x^2 - 1 = 0$.",
  "answer": "$x = 1$ or $x = -1$",
  "solution": "Factor as $(x-1)(x+1)=0$.",
  "label": 1,
  "question_chapter_title": "Chapter 1: Quadratic Equations",
  "answer_chapter_title": "Chapter 1: Quadratic Equations",
  "image_basedir": "/path/to/your/images"
}

Note

We also support using a local MinerU deployment: Replace FileOrURLToMarkdownConverterAPI with FileOrURLToMarkdownConverterLocal or FileOrURLToMarkdownConverterFlash in pipelines/vqa_extract_optimized_pipeline.py:

# Original opendatalab local version
self.mineru_executor = FileOrURLToMarkdownConverterLocal(
    intermediate_dir="intermediate",
    mineru_model_path="path/to/mineru/model",
)

# Accelerated version (Flash)
self.mineru_executor = FileOrURLToMarkdownConverterFlash(
    intermediate_dir="intermediate",
    mineru_model_path="path/to/mineru/model",
    batch_size=4,
    replicas=1,
    num_gpus_per_replica=1,
    engine_gpu_util_rate_to_ray_cap=0.9,
)

See DataFlow's MinerU operators for full parameter documentation.

Pipeline details

The extraction pipeline runs six steps:

PDF Merging (PDF_Merger): If multiple PDFs are provided, merges them into one.
Document Layout Parsing (FileOrURLToMarkdownConverterAPI): Calls the MinerU API to produce structured JSON layout tokens and page images.
Layout Preprocessing (MinerU2LLMInputOperator): Flattens list items and re-indexes IDs to prepare LLM-ready input.
LLM Extraction (ChunkedPromptedGenerator): Chunks the layout JSON (max 128k tokens per chunk) and calls the LLM with QAExtractPrompt to extract QA pairs as structured XML.
Output Parsing (LLMOutputParser): Parses the XML response into JSONL and copies images to vqa_images/.
QA Merging (QA_Merger): For separated question/answer PDFs, matches question and answer blocks by chapter title and question number. This operator includes a strict_title_match parameter: When set to True, the operator performs an exact string match on chapter titles. Otherwise, the operator attempts to extract Chinese or English sequence numbers from the titles for matching.

Stage 2: Data Curation

python -m pipelines.curate_data \
    --input_file ./output/raw_vqa.jsonl \
    --api_url https://api.openai.com/v1 \
    --model gpt-5-mini

Output is saved as curated_vqa.jsonl in the same directory as --input_file.

Pipeline details

Four sequential steps:

1. Sub-question Splitting

Questions with multiple independent parts (e.g. (a), (b), (c)) are split into separate items. Each sub-question is paired with its corresponding sub-answer and sub-solution. Items where the question or both answer and solution are empty are discarded.

Sub-questions that are context-sensitive (e.g. (b) uses the result of (a)) will not be split into separate items.

Adds field: split_qa

2. Question Type Classification

Each question is classified as one of: Calculation, Proof, Explanation, Fill-in, Multiple-choice, Sketching, Other.

By default, only Calculation, Fill-in, and Multiple-choice are retained. To change this, edit the filter_rules list in DataCurationPipeline.__init__.

Adds fields: type, type_reason

3. Answer Extraction

Extracts a concise final answer from the solution field and writes it to answer. Items that already have a non-empty answer are skipped (set overwrite=True in AnswerExtractionOperator to override).

4. QA Filtering

Removes items based on the following criteria:

The question must pose a clear, specific problem suitable for an exam. Examples, statements without questions, and open-ended discussions are rejected.
The answer must directly address the question.
The question and answer must be self-contained, without relying on external references or omitted context.

Adds fields: filter_result, filter_reason

Stage 3: Generate CoT

The answer model and judge model can use different API endpoints and API keys, which is useful when the answer model is a self-hosted open-source VLM (e.g. Qwen3-VL served via vLLM) and the judge model is a commercial API.

Use --answer_api_key_env / --judge_api_key_env to specify which environment variable holds the API key for each model (default: DF_API_KEY for both).

# Example: self-hosted Qwen3-VL for answers, OpenAI for judging
export VLLM_API_KEY="token-xxxx"   # or leave empty if your vLLM server needs no key
export DF_API_KEY="sk-xxxx"

python -m pipelines.generate_cot \
    --input_file ./output/curated_vqa.jsonl \
    --max_retries 5 \
    --answer_api_url https://your-vllm-server/v1 \
    --answer_model qwen3-vl-235b-thinking \
    --answer_api_key_env VLLM_API_KEY \
    --judge_api_url https://api.openai.com/v1 \
    --judge_model gpt-5-mini \
    --judge_api_key_env DF_API_KEY

Output is saved as curated_vqa_with_cot.jsonl in the same directory as --input_file.

Pipeline details

Uses reject sampling over up to max_retries rounds:

1. Answer Generation (VQAReasoningAnswerGenerator)

The LLM generates a step-by-step answer. Set skip_text_only=True in RejectSamplingPipeline to process only VQA items (questions containing images); set to False to process all items. Generated answer stored in generated_cot.

2. Thinking Cleanup

Strips <think>...</think> content from the generated answer to reduce verification cost. The cleaned answer is stored in llm_short_answer. Assumes the model outputs <think>THINK</think>ANSWER or THINK</think>ANSWER.

3. Answer Verification (BenchDatasetEvaluatorQuestion)

Compares llm_short_answer against the ground truth answer using semantic LLM evaluation (with 5% numerical tolerance). Items that pass are marked answer_match_result = True and skipped in subsequent rounds.

Set support_subquestions=True to evaluate each sub-question independently; answer_match_result is False if any sub-question is wrong.

Evaluation statistics (overall accuracy, sub-question accuracy) are saved to ./cot_cache/eval_results.jsonl:

{
  "total_samples": 23584,
  "matched_samples": 12281,
  "accuracy": 0.521,
  "total_subquestions": 26380,
  "correct_subquestions": 13807,
  "subquestion_accuracy": 0.523
}

Examples

Sample PDFs and input JSONL are provided in examples/VQA/:

examples/VQA/
├── vqa_extract_test.jsonl    # Example input for Stage 1
├── questionextract_test.pdf  # Single PDF with interleaved Q&A
├── math_question.pdf         # Questions PDF (for separated Q&A demo)
└── math_answer.pdf           # Answers PDF (for separated Q&A demo)

To run the full pipeline on the examples:

# Stage 1: Extract
python -m pipelines.vqa_extract_optimized_pipeline \
    --input_file ./examples/VQA/vqa_extract_test.jsonl \
    --output_dir ./output \
    --api_url https://generativelanguage.googleapis.com/v1beta/openai/ \
    --model gemini-2.5-pro

# Stage 2: Curate
python -m pipelines.curate_data \
    --input_file ./output/raw_vqa.jsonl \
    --api_url https://api.openai.com/v1 \
    --model gpt-5-mini

# Stage 3: Generate CoT
# Example: self-hosted Qwen3-VL for answers, OpenAI for judging
export VLLM_API_KEY="token-xxxx"   # or leave empty if your vLLM server needs no key
export DF_API_KEY="sk-xxxx"

python -m pipelines.generate_cot \
    --input_file ./output/curated_vqa.jsonl \
    --max_retries 5 \
    --answer_api_url https://your-vllm-server/v1 \
    --answer_model qwen3-vl-235b-thinking \
    --answer_api_key_env VLLM_API_KEY \
    --judge_api_url https://api.openai.com/v1 \
    --judge_model gpt-5-mini \
    --judge_api_key_env DF_API_KEY

Note

The implementation in this repository is only for running a demo at small scale. If you wish to run the pipeline on large number of books, you will probably need features Checkpoint Resume and Batched Inference.

License

This project is licensed under the Apache License 2.0.

Name		Name	Last commit message	Last commit date
Latest commit History 40 Commits
examples/VQA		examples/VQA
operators		operators
pipelines		pipelines
prompts		prompts
static		static
utils		utils
.gitignore		.gitignore
FlipVQA_full.pdf		FlipVQA_full.pdf
LICENSE		LICENSE
README.md		README.md
README_zh.md		README_zh.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DataFlow-VQA

Overview

Installation

Configuration

API Keys

LLM Endpoint

Stage 1: VQA Extraction

Input Format

Run

Output

Note

Stage 2: Data Curation

Stage 3: Generate CoT

Examples

Note

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

DataFlow-VQA

Overview

Installation

Configuration

API Keys

LLM Endpoint

Stage 1: VQA Extraction

Input Format

Run

Output

Note

Stage 2: Data Curation

Stage 3: Generate CoT

Examples

Note

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages