An evaluation dataset for testing Retrieval-Augmented Generation (RAG) systems on neural networks and deep learning content. 20 human-verified questions with ground truth answers, sourced from 4 educational YouTube videos.
flowchart LR
subgraph Phase 1
V[4 YouTube Videos] --> R[Raw Transcripts]
R --> T[Translate Hindi→English]
T --> P[Clean & Chunk]
end
subgraph Phase 2
P --> G[Gemini Topic Extraction]
G --> HR[Human Review]
HR --> TM[Topic Map - 71 topics]
end
subgraph Phase 3
TM --> TS[Select 20 Topics]
TS --> QG[Gemini Question Generation]
QG --> QR[Human Review]
QR --> QJ[20 Questions]
end
subgraph Phase 4
QJ --> AG[Gemini Draft Answers]
AG --> CC[Confidence Cross-Check]
CC --> HV[Human Verification]
HV --> GT[Ground Truth]
end
subgraph Phase 5-6
GT --> VD[Validate Dataset]
VD --> FD[eval_dataset_v1.0.json]
end
| Metric | Value |
|---|---|
| Total Questions | 20 |
| Source Videos | 4 (2 English, 2 Hindi) |
| Factual / Conceptual / Comparative / Application | 6 / 6 / 4 / 4 |
| Easy / Medium / Hard | 6 / 9 / 5 |
| Questions per Video | 5 each |
# Install dependencies
pip install -r requirements.txt
# Validate the dataset
python scripts/validate_dataset.py validate
# Dry-run (score ground truth against itself)
python scripts/validate_dataset.py dry-run
# Score your RAG system's predictions
python scripts/validate_dataset.py score --input predictions.json{
"predictions": [
{"question_id": "Q001", "answer": "Your RAG system's answer here..."}
]
}| Video | Creator | Language | Duration |
|---|---|---|---|
| But what is a Neural Network? | 3Blue1Brown | English | ~19 min |
| Transformers, the tech behind LLMs | 3Blue1Brown | English | ~27 min |
| What is Deep Learning? | CampusX | Hindi | ~67 min |
| All About ML & Deep Learning | CodeWithHarry | Hindi | ~15 min |
vidqa-rag/
├── deliverables/
│ └── eval_dataset_v1.0.json # ← THE DELIVERABLE
├── data/
│ ├── ground_truth.json # Build artifact (Phase 4)
│ ├── questions.json # Build artifact (Phase 3)
│ ├── topic_map.json # Build artifact (Phase 2)
│ ├── *_review.csv # Human review artifacts
│ ├── raw/ # Raw transcripts
│ ├── processed/ # Cleaned transcripts
│ └── translated/ # Hindi→English translations
├── scripts/
│ ├── extract_transcripts.py # Phase 1: YouTube → raw JSON
│ ├── translate_transcripts.py # Phase 1: Hindi → English
│ ├── clean_transcripts.py # Phase 1: Sentence chunking
│ ├── validate_schema.py # Phase 1: Transcript validation
│ ├── hybrid_topic_extractor.py # Phase 2: Topic extraction
│ ├── question_generator.py # Phase 3: Question generation
│ ├── ground_truth_generator.py # Phase 4: Answer generation
│ └── validate_dataset.py # Phase 5: Validation + scoring
├── guide/
│ └── EVAL_GUIDE.md # How to use the dataset
├── config/settings.yaml
├── schemas/transcript_schema.json
├── results/ # Scoring output
├── requirements.txt
└── README.md
graph TD
A[Phase 1: Transcript Collection] -->|4 videos, 128 min| B[Phase 2: Topic Extraction]
B -->|71 topics via Gemini + human review| C[Phase 3: Question Design]
C -->|20 questions, balanced allocation| D[Phase 4: Ground Truth]
D -->|LLM-drafted, human-verified answers| E[Phase 5: Validation]
E -->|Schema checks + scoring dry-run| F[Phase 6: Packaging]
style A fill:#e1f5fe
style B fill:#e8f5e9
style C fill:#fff3e0
style D fill:#fce4ec
style E fill:#f3e5f5
style F fill:#e0f2f1
Each phase uses the same pattern: automate with Gemini → export CSV → human review → finalize JSON. See docs/ for detailed phase documentation.
The scoring script computes:
| Metric | Description |
|---|---|
| Exact Match | Verbatim match (normalized) against answer + variations |
| Token F1 | Word-level precision/recall harmonic mean |
| Keyword Recall | Fraction of ground truth keywords in prediction |
- Python 3.12+
- Dependencies:
pip install -r requirements.txt - Gemini API key in
.env(only needed to regenerate data, not for scoring)