NeuralNet-RAG-Eval

An evaluation dataset for testing Retrieval-Augmented Generation (RAG) systems on neural networks and deep learning content. 20 human-verified questions with ground truth answers, sourced from 4 educational YouTube videos.

Pipeline

flowchart LR
    subgraph Phase 1
        V[4 YouTube Videos] --> R[Raw Transcripts]
        R --> T[Translate Hindi→English]
        T --> P[Clean & Chunk]
    end
    subgraph Phase 2
        P --> G[Gemini Topic Extraction]
        G --> HR[Human Review]
        HR --> TM[Topic Map - 71 topics]
    end
    subgraph Phase 3
        TM --> TS[Select 20 Topics]
        TS --> QG[Gemini Question Generation]
        QG --> QR[Human Review]
        QR --> QJ[20 Questions]
    end
    subgraph Phase 4
        QJ --> AG[Gemini Draft Answers]
        AG --> CC[Confidence Cross-Check]
        CC --> HV[Human Verification]
        HV --> GT[Ground Truth]
    end
    subgraph Phase 5-6
        GT --> VD[Validate Dataset]
        VD --> FD[eval_dataset_v1.0.json]
    end

Dataset Stats

Metric	Value
Total Questions	20
Source Videos	4 (2 English, 2 Hindi)
Factual / Conceptual / Comparative / Application	6 / 6 / 4 / 4
Easy / Medium / Hard	6 / 9 / 5
Questions per Video	5 each

Quickstart

# Install dependencies
pip install -r requirements.txt

# Validate the dataset
python scripts/validate_dataset.py validate

# Dry-run (score ground truth against itself)
python scripts/validate_dataset.py dry-run

# Score your RAG system's predictions
python scripts/validate_dataset.py score --input predictions.json

Prediction Format

{
  "predictions": [
    {"question_id": "Q001", "answer": "Your RAG system's answer here..."}
  ]
}

Source Videos

Video	Creator	Language	Duration
But what is a Neural Network?	3Blue1Brown	English	~19 min
Transformers, the tech behind LLMs	3Blue1Brown	English	~27 min
What is Deep Learning?	CampusX	Hindi	~67 min
All About ML & Deep Learning	CodeWithHarry	Hindi	~15 min

Repository Structure

vidqa-rag/
├── deliverables/
│   └── eval_dataset_v1.0.json       # ← THE DELIVERABLE
├── data/
│   ├── ground_truth.json            # Build artifact (Phase 4)
│   ├── questions.json               # Build artifact (Phase 3)
│   ├── topic_map.json               # Build artifact (Phase 2)
│   ├── *_review.csv                 # Human review artifacts
│   ├── raw/                         # Raw transcripts
│   ├── processed/                   # Cleaned transcripts
│   └── translated/                  # Hindi→English translations
├── scripts/
│   ├── extract_transcripts.py       # Phase 1: YouTube → raw JSON
│   ├── translate_transcripts.py     # Phase 1: Hindi → English
│   ├── clean_transcripts.py         # Phase 1: Sentence chunking
│   ├── validate_schema.py           # Phase 1: Transcript validation
│   ├── hybrid_topic_extractor.py    # Phase 2: Topic extraction
│   ├── question_generator.py        # Phase 3: Question generation
│   ├── ground_truth_generator.py    # Phase 4: Answer generation
│   └── validate_dataset.py          # Phase 5: Validation + scoring
├── guide/
│   └── EVAL_GUIDE.md               # How to use the dataset
├── config/settings.yaml
├── schemas/transcript_schema.json
├── results/                         # Scoring output
├── requirements.txt
└── README.md

How It Was Built

graph TD
    A[Phase 1: Transcript Collection] -->|4 videos, 128 min| B[Phase 2: Topic Extraction]
    B -->|71 topics via Gemini + human review| C[Phase 3: Question Design]
    C -->|20 questions, balanced allocation| D[Phase 4: Ground Truth]
    D -->|LLM-drafted, human-verified answers| E[Phase 5: Validation]
    E -->|Schema checks + scoring dry-run| F[Phase 6: Packaging]
    
    style A fill:#e1f5fe
    style B fill:#e8f5e9
    style C fill:#fff3e0
    style D fill:#fce4ec
    style E fill:#f3e5f5
    style F fill:#e0f2f1

Each phase uses the same pattern: automate with Gemini → export CSV → human review → finalize JSON. See docs/ for detailed phase documentation.

Metrics

The scoring script computes:

Metric	Description
Exact Match	Verbatim match (normalized) against answer + variations
Token F1	Word-level precision/recall harmonic mean
Keyword Recall	Fraction of ground truth keywords in prediction

Requirements

Python 3.12+
Dependencies: pip install -r requirements.txt
Gemini API key in .env (only needed to regenerate data, not for scoring)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

NeuralNet-RAG-Eval

Pipeline

Dataset Stats

Quickstart

Prediction Format

Source Videos

Repository Structure

How It Was Built

Metrics

Requirements

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
config		config
data		data
deliverables		deliverables
guides		guides
schemas		schemas
scripts		scripts
.env.example		.env.example
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

NeuralNet-RAG-Eval

Pipeline

Dataset Stats

Quickstart

Prediction Format

Source Videos

Repository Structure

How It Was Built

Metrics

Requirements

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages