A pipeline for extracting, curating, and generating chain-of-thought (CoT) data from PDF textbooks and exam papers.
📄Full Paper with Appendices 🤗Dataset 🤗FlipVQA-Miner Demo
DataFlow-VQA processes PDF documents through three sequential stages:
- Stage1 (Section 3.1: VQA Extraction): Parses PDFs using MinerU for document layout analysis, then uses an LLM to extract structured question-answer pairs with images.
- Stage2 (Section 3.2.1 to Section 3.2.5: Data Curation): Filters and cleans the extracted QA pairs — splits sub-questions, classifies question types, extracts concise answers, and removes low-quality items.
- Stage3 (Section 3.2.6: CoT Generation): Generates chain-of-thought reasoning via reject sampling — an LLM generates answers, which are verified against ground truth, and incorrect ones are retried.
This project is built on top of DataFlow. Clone and install it first:
git clone https://github.com/OpenDCAI/DataFlow.git
cd DataFlow
pip install -e ".[pdf2vqa]"Then clone this repository:
git clone <this-repo-url>
cd DataFlow-VQATwo API keys are required:
DF_API_KEY: API key for the LLM service (OpenAI, Google Gemini, DeepSeek, etc.)MINERU_API_KEY: API key for MinerU document layout parsing
export DF_API_KEY="sk-xxxxx"
export MINERU_API_KEY="sk2-xxxxx"Each pipeline accepts --api_url and --model arguments. Any OpenAI-compatible API endpoint is supported, including OpenAI, Google Gemini (via proxy), DeepSeek, and others.
Provide the base URL without /chat/completions (e.g. https://api.openai.com/v1).
Create a JSONL file where each line describes one PDF extraction task:
{"input_pdf_paths": "./examples/VQA/questionextract_test.pdf", "name": "math1"}
{"input_pdf_paths": ["./examples/VQA/math_question.pdf", "./examples/VQA/math_answer.pdf"], "name": "math2"}input_pdf_paths: A single PDF (questions and answers interleaved) or a list of two or more PDFs (questions before answers).name: A unique identifier for this task (used for directory naming and caching).
python -m pipelines.vqa_extract_optimized_pipeline \
--input_file ./examples/VQA/vqa_extract_test.jsonl \
--output_dir ./output \
--api_url https://generativelanguage.googleapis.com/v1beta/openai/ \
--model gemini-2.5-proImportant: We recommend using a strong powerful model here. Weak models like gpt-5-mini might perform bad.
{output_dir}/raw_vqa.jsonl: Extracted QA pairs with image references{output_dir}/{name}/vqa_images/: Extracted imagescache/{name}/extracted_vqa.jsonl,merged_qa_pairs.jsonl,merged_qa_pairs.md: Per-task intermediate files
Each QA item contains:
{
"question": "Compute $x$ such that $x^2 - 1 = 0$.",
"answer": "$x = 1$ or $x = -1$",
"solution": "Factor as $(x-1)(x+1)=0$.",
"label": 1,
"question_chapter_title": "Chapter 1: Quadratic Equations",
"answer_chapter_title": "Chapter 1: Quadratic Equations",
"image_basedir": "/path/to/your/images"
}We also support using a local MinerU deployment: Replace FileOrURLToMarkdownConverterAPI with FileOrURLToMarkdownConverterLocal or FileOrURLToMarkdownConverterFlash in pipelines/vqa_extract_optimized_pipeline.py:
# Original opendatalab local version
self.mineru_executor = FileOrURLToMarkdownConverterLocal(
intermediate_dir="intermediate",
mineru_model_path="path/to/mineru/model",
)
# Accelerated version (Flash)
self.mineru_executor = FileOrURLToMarkdownConverterFlash(
intermediate_dir="intermediate",
mineru_model_path="path/to/mineru/model",
batch_size=4,
replicas=1,
num_gpus_per_replica=1,
engine_gpu_util_rate_to_ray_cap=0.9,
)See DataFlow's MinerU operators for full parameter documentation.
Pipeline details
The extraction pipeline runs six steps:
- PDF Merging (
PDF_Merger): If multiple PDFs are provided, merges them into one. - Document Layout Parsing (
FileOrURLToMarkdownConverterAPI): Calls the MinerU API to produce structured JSON layout tokens and page images. - Layout Preprocessing (
MinerU2LLMInputOperator): Flattens list items and re-indexes IDs to prepare LLM-ready input. - LLM Extraction (
ChunkedPromptedGenerator): Chunks the layout JSON (max 128k tokens per chunk) and calls the LLM withQAExtractPromptto extract QA pairs as structured XML. - Output Parsing (
LLMOutputParser): Parses the XML response into JSONL and copies images tovqa_images/. - QA Merging (
QA_Merger): For separated question/answer PDFs, matches question and answer blocks by chapter title and question number. This operator includes astrict_title_matchparameter: When set to True, the operator performs an exact string match on chapter titles. Otherwise, the operator attempts to extract Chinese or English sequence numbers from the titles for matching.
python -m pipelines.curate_data \
--input_file ./output/raw_vqa.jsonl \
--api_url https://api.openai.com/v1 \
--model gpt-5-miniOutput is saved as curated_vqa.jsonl in the same directory as --input_file.
Pipeline details
Four sequential steps:
1. Sub-question Splitting
Questions with multiple independent parts (e.g. (a), (b), (c)) are split into separate items. Each sub-question is paired with its corresponding sub-answer and sub-solution. Items where the question or both answer and solution are empty are discarded.
Sub-questions that are context-sensitive (e.g. (b) uses the result of (a)) will not be split into separate items.
Adds field: split_qa
2. Question Type Classification
Each question is classified as one of: Calculation, Proof, Explanation, Fill-in, Multiple-choice, Sketching, Other.
By default, only Calculation, Fill-in, and Multiple-choice are retained. To change this, edit the filter_rules list in DataCurationPipeline.__init__.
Adds fields: type, type_reason
3. Answer Extraction
Extracts a concise final answer from the solution field and writes it to answer. Items that already have a non-empty answer are skipped (set overwrite=True in AnswerExtractionOperator to override).
4. QA Filtering
Removes items based on the following criteria:
- The question must pose a clear, specific problem suitable for an exam. Examples, statements without questions, and open-ended discussions are rejected.
- The answer must directly address the question.
- The question and answer must be self-contained, without relying on external references or omitted context.
Adds fields: filter_result, filter_reason
The answer model and judge model can use different API endpoints and API keys, which is useful when the answer model is a self-hosted open-source VLM (e.g. Qwen3-VL served via vLLM) and the judge model is a commercial API.
Use --answer_api_key_env / --judge_api_key_env to specify which environment variable holds the API key for each model (default: DF_API_KEY for both).
# Example: self-hosted Qwen3-VL for answers, OpenAI for judging
export VLLM_API_KEY="token-xxxx" # or leave empty if your vLLM server needs no key
export DF_API_KEY="sk-xxxx"
python -m pipelines.generate_cot \
--input_file ./output/curated_vqa.jsonl \
--max_retries 5 \
--answer_api_url https://your-vllm-server/v1 \
--answer_model qwen3-vl-235b-thinking \
--answer_api_key_env VLLM_API_KEY \
--judge_api_url https://api.openai.com/v1 \
--judge_model gpt-5-mini \
--judge_api_key_env DF_API_KEYOutput is saved as curated_vqa_with_cot.jsonl in the same directory as --input_file.
Pipeline details
Uses reject sampling over up to max_retries rounds:
1. Answer Generation (VQAReasoningAnswerGenerator)
The LLM generates a step-by-step answer. Set skip_text_only=True in RejectSamplingPipeline to process only VQA items (questions containing images); set to False to process all items. Generated answer stored in generated_cot.
2. Thinking Cleanup
Strips <think>...</think> content from the generated answer to reduce verification cost. The cleaned answer is stored in llm_short_answer. Assumes the model outputs <think>THINK</think>ANSWER or THINK</think>ANSWER.
3. Answer Verification (BenchDatasetEvaluatorQuestion)
Compares llm_short_answer against the ground truth answer using semantic LLM evaluation (with 5% numerical tolerance). Items that pass are marked answer_match_result = True and skipped in subsequent rounds.
Set support_subquestions=True to evaluate each sub-question independently; answer_match_result is False if any sub-question is wrong.
Evaluation statistics (overall accuracy, sub-question accuracy) are saved to ./cot_cache/eval_results.jsonl:
{
"total_samples": 23584,
"matched_samples": 12281,
"accuracy": 0.521,
"total_subquestions": 26380,
"correct_subquestions": 13807,
"subquestion_accuracy": 0.523
}Sample PDFs and input JSONL are provided in examples/VQA/:
examples/VQA/
├── vqa_extract_test.jsonl # Example input for Stage 1
├── questionextract_test.pdf # Single PDF with interleaved Q&A
├── math_question.pdf # Questions PDF (for separated Q&A demo)
└── math_answer.pdf # Answers PDF (for separated Q&A demo)
To run the full pipeline on the examples:
# Stage 1: Extract
python -m pipelines.vqa_extract_optimized_pipeline \
--input_file ./examples/VQA/vqa_extract_test.jsonl \
--output_dir ./output \
--api_url https://generativelanguage.googleapis.com/v1beta/openai/ \
--model gemini-2.5-pro
# Stage 2: Curate
python -m pipelines.curate_data \
--input_file ./output/raw_vqa.jsonl \
--api_url https://api.openai.com/v1 \
--model gpt-5-mini
# Stage 3: Generate CoT
# Example: self-hosted Qwen3-VL for answers, OpenAI for judging
export VLLM_API_KEY="token-xxxx" # or leave empty if your vLLM server needs no key
export DF_API_KEY="sk-xxxx"
python -m pipelines.generate_cot \
--input_file ./output/curated_vqa.jsonl \
--max_retries 5 \
--answer_api_url https://your-vllm-server/v1 \
--answer_model qwen3-vl-235b-thinking \
--answer_api_key_env VLLM_API_KEY \
--judge_api_url https://api.openai.com/v1 \
--judge_model gpt-5-mini \
--judge_api_key_env DF_API_KEYThe implementation in this repository is only for running a demo at small scale. If you wish to run the pipeline on large number of books, you will probably need features Checkpoint Resume and Batched Inference.
This project is licensed under the Apache License 2.0.