Skip to content

sidyr6002/vidqa-rag

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

NeuralNet-RAG-Eval

An evaluation dataset for testing Retrieval-Augmented Generation (RAG) systems on neural networks and deep learning content. 20 human-verified questions with ground truth answers, sourced from 4 educational YouTube videos.

Pipeline

flowchart LR
    subgraph Phase 1
        V[4 YouTube Videos] --> R[Raw Transcripts]
        R --> T[Translate Hindi→English]
        T --> P[Clean & Chunk]
    end
    subgraph Phase 2
        P --> G[Gemini Topic Extraction]
        G --> HR[Human Review]
        HR --> TM[Topic Map - 71 topics]
    end
    subgraph Phase 3
        TM --> TS[Select 20 Topics]
        TS --> QG[Gemini Question Generation]
        QG --> QR[Human Review]
        QR --> QJ[20 Questions]
    end
    subgraph Phase 4
        QJ --> AG[Gemini Draft Answers]
        AG --> CC[Confidence Cross-Check]
        CC --> HV[Human Verification]
        HV --> GT[Ground Truth]
    end
    subgraph Phase 5-6
        GT --> VD[Validate Dataset]
        VD --> FD[eval_dataset_v1.0.json]
    end
Loading

Dataset Stats

Metric Value
Total Questions 20
Source Videos 4 (2 English, 2 Hindi)
Factual / Conceptual / Comparative / Application 6 / 6 / 4 / 4
Easy / Medium / Hard 6 / 9 / 5
Questions per Video 5 each

Quickstart

# Install dependencies
pip install -r requirements.txt

# Validate the dataset
python scripts/validate_dataset.py validate

# Dry-run (score ground truth against itself)
python scripts/validate_dataset.py dry-run

# Score your RAG system's predictions
python scripts/validate_dataset.py score --input predictions.json

Prediction Format

{
  "predictions": [
    {"question_id": "Q001", "answer": "Your RAG system's answer here..."}
  ]
}

Source Videos

Video Creator Language Duration
But what is a Neural Network? 3Blue1Brown English ~19 min
Transformers, the tech behind LLMs 3Blue1Brown English ~27 min
What is Deep Learning? CampusX Hindi ~67 min
All About ML & Deep Learning CodeWithHarry Hindi ~15 min

Repository Structure

vidqa-rag/
├── deliverables/
│   └── eval_dataset_v1.0.json       # ← THE DELIVERABLE
├── data/
│   ├── ground_truth.json            # Build artifact (Phase 4)
│   ├── questions.json               # Build artifact (Phase 3)
│   ├── topic_map.json               # Build artifact (Phase 2)
│   ├── *_review.csv                 # Human review artifacts
│   ├── raw/                         # Raw transcripts
│   ├── processed/                   # Cleaned transcripts
│   └── translated/                  # Hindi→English translations
├── scripts/
│   ├── extract_transcripts.py       # Phase 1: YouTube → raw JSON
│   ├── translate_transcripts.py     # Phase 1: Hindi → English
│   ├── clean_transcripts.py         # Phase 1: Sentence chunking
│   ├── validate_schema.py           # Phase 1: Transcript validation
│   ├── hybrid_topic_extractor.py    # Phase 2: Topic extraction
│   ├── question_generator.py        # Phase 3: Question generation
│   ├── ground_truth_generator.py    # Phase 4: Answer generation
│   └── validate_dataset.py          # Phase 5: Validation + scoring
├── guide/
│   └── EVAL_GUIDE.md               # How to use the dataset
├── config/settings.yaml
├── schemas/transcript_schema.json
├── results/                         # Scoring output
├── requirements.txt
└── README.md

How It Was Built

graph TD
    A[Phase 1: Transcript Collection] -->|4 videos, 128 min| B[Phase 2: Topic Extraction]
    B -->|71 topics via Gemini + human review| C[Phase 3: Question Design]
    C -->|20 questions, balanced allocation| D[Phase 4: Ground Truth]
    D -->|LLM-drafted, human-verified answers| E[Phase 5: Validation]
    E -->|Schema checks + scoring dry-run| F[Phase 6: Packaging]
    
    style A fill:#e1f5fe
    style B fill:#e8f5e9
    style C fill:#fff3e0
    style D fill:#fce4ec
    style E fill:#f3e5f5
    style F fill:#e0f2f1
Loading

Each phase uses the same pattern: automate with Gemini → export CSV → human review → finalize JSON. See docs/ for detailed phase documentation.

Metrics

The scoring script computes:

Metric Description
Exact Match Verbatim match (normalized) against answer + variations
Token F1 Word-level precision/recall harmonic mean
Keyword Recall Fraction of ground truth keywords in prediction

Requirements

  • Python 3.12+
  • Dependencies: pip install -r requirements.txt
  • Gemini API key in .env (only needed to regenerate data, not for scoring)

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages