GitHub - Sachin1801/Course-RAG-Assistant: A Custom RAG agent that can help assisting and saving time on long lectures and finding relevant video and information from the videos for you.

This folder is to be used to store the code for your project. # Chat with Your Video Library

Team:

Sachin Adlakha (sa9082)
Subhan Akhtar (sa8580)

We built this Jupyter notebook ( final_submission.ipynb) to guide a full RAG pipeline over our course videos—starting from raw downloads through to a live Gradio chatbot powered by retrieval and generation.

Example prompts as per project description

We have attached the screenshots to the example prompts.

The video demonstration can be seen at this link.

Repository Layout

final_submission.ipynb         # The main notebook to submit
data/raw/videos/         # Extracted video folders (+ VTT/JSON captions)
clips/                   # Output folder for trimmed MP4 clips
EXAMPLES.md

Prerequisites

Python 3.10+ and system ffmpeg installed
Hugging Face access token for dataset downloads
Qdrant Cloud cluster URL & API key
OpenRouter (or compatible LLM) API key

1) Environment & Dependencies

We install all required packages directly in the notebook:

!pip install -q \
    webvtt-py spacy==3.7.3 bertopic[visualization] \
    sentence-transformers qdrant-client==1.8.0 ffmpeg-python \
    git+https://github.com/openai/CLIP.git
!python -m spacy download en_core_web_sm
!pip install -q "huggingface_hub[hf_xet]"

Before launching the Gradio app, we also install:

!pip install -q gradio ffmpeg-python qdrant-client \
    sentence-transformers transformers torch bitsandbytes

2) Download & Extract Videos

We pull down and unpack the webdataset from Hugging Face Hub:

from huggingface_hub import login, hf_hub_download
from pathlib import Path
import tarfile

login(token="<YOUR_HF_TOKEN>")

dl_path = hf_hub_download(
    repo_id="aegean-ai/ai-lectures-spring-24",
    filename="youtube_dataset.tar",
    repo_type="dataset"
)
Path("data/raw").mkdir(parents=True, exist_ok=True)
tarfile.open(dl_path).extractall("data/raw")

After extraction, data/raw/videos/ contains one folder per YouTube video, each with an .mp4 and associated .vtt or .json caption file.

3) Load Captions

We define a helper load_captions(path: Path) that yields (text, start, end) tuples from either WebVTT or Whisper JSON. Then we iterate through every video directory:

from webvtt import WebVTT
import json, pandas as pd

def load_captions(path: Path):
    if path.suffix == ".vtt":
        for cue in WebVTT().read(path):
            yield cue.text.strip(), cue.start_in_seconds, cue.end_in_seconds
    else:
        with open(path) as f:
            data = json.load(f)
        for seg in data.get("segments", []):
            yield seg["text"].strip(), seg["start"], seg["end"]

rows = []
for vid_dir in Path("data/raw/videos").iterdir():
    for cap_file in vid_dir.glob("*.[vj][sot]*"):
        for text, start, end in load_captions(cap_file):
            rows.append({"video": vid_dir.name, "text": text, "start": start, "end": end})
df = pd.DataFrame(rows)

4) Merge & De-duplicate Captions

We merge cues separated by less than 1 second into blobs, then drop exact duplicates:

# merge consecutive cues into blobs
merged = []
# (see notebook code for merging logic)
df_long = pd.DataFrame(merged, columns=["video","blob","start","end"]) \
    .drop_duplicates("blob")

5) Sentence Segmentation

Using spaCy (en_core_web_sm), we split each blob into sentences, and assign each sentence a proportional slice of the blob's timestamp range:

import spacy
nlp = spacy.load("en_core_web_sm")
records = []
for vid, blob, t0, t1 in df_long.itertuples(index=False):
    doc = nlp(blob)
    spans = list(doc.sents)
    span_dur = (t1 - t0) / max(1, len(spans))
    for i, sent in enumerate(spans):
        records.append({
            "video": vid,
            "sentence": sent.text.strip(),
            "start": t0 + i*span_dur,
            "end": t0 + (i+1)*span_dur
        })
sent_df = pd.DataFrame(records)

6) Quick BERTopic Sanity Check

To validate our text pipeline, we sample 30 sentences and run a small UMAP+BERTopic model:

from umap import UMAP
from bertopic import BERTopic
sample = sent_df.sample(30, random_state=42)
umap_small = UMAP(n_neighbors=10, n_components=5, metric="cosine", random_state=42)
topic_model = BERTopic(umap_model=umap_small, calculate_probabilities=False, verbose=False)
topics, _ = topic_model.fit_transform(sample.sentence.tolist())

If this step succeeds, we move on to full-scale embedding.

7) Full BERTopic with CLIP Embeddings

We switch to CLIP-based embeddings for richer semantic features:

from sentence_transformers import SentenceTransformer
clip_model = SentenceTransformer("clip-ViT-B-32")

def truncate(text, max_tok=75):
    ids = clip_model.tokenizer(text)["input_ids"][ :max_tok]
    return clip_model.tokenizer.decode(ids, skip_special_tokens=True)

short_txts = [truncate(t) for t in sent_df.sentence]
topic_model = BERTopic(
    embedding_model=clip_model,
    language="multilingual",
    min_topic_size=5,
    verbose=True
)
topics, probs = topic_model.fit_transform(short_txts)

8) Store Vectors in Qdrant

We encode all sentences, build a metadata DataFrame, then push to our Qdrant Cloud cluster:

embeddings = clip_model.encode(short_txts, show_progress_bar=True, convert_to_numpy=True)
meta = pd.DataFrame({
    "video": sent_df.video,
    "start": sent_df.start,
    "end": sent_df.end,
    "sentence": short_txts,
    "topic": topics
})

from qdrant_client import QdrantClient, models
client = QdrantClient(url=QDRANT_URL, api_key=QDRANT_API_KEY)
client.recreate_collection(
    collection_name=COLLECTION_NAME,
    vectors_config=models.VectorParams(size=embeddings.shape[1], distance=models.Distance.COSINE)
)
client.upload_collection(
    collection_name=COLLECTION_NAME,
    vectors=embeddings.tolist(),
    payload=meta.to_dict("records"),
    ids=list(range(len(embeddings)))
)
print(f"✅ Uploaded {len(embeddings)} vectors to Qdrant collection '{COLLECTION_NAME}'")

9) Query, Trim & Display Clips

We test a sample query, merge time ranges by video, trim clips via ffmpeg-python, and display them:

from IPython.display import Video, display
results = client.search(collection_name=COLLECTION_NAME, query_vector=clip_model.encode([QUESTION])[0], limit=5)
# merge hits per video → (t0, t1)
# ffmpeg trim into clips/ folder
display(Video(str(out_fp), embed=True))

10) Gradio App

Finally, we assemble a live Gradio interface:

import gradio as gr
with gr.Blocks(title="Video Q&A with Multi-Video View") as demo:
    gr.Markdown("# Video Q&A with Multi-Video View")
    question = gr.Textbox(label="Your Question", placeholder="e.g., ...", lines=2)
    examples = gr.Examples([...], inputs=question)
    gallery = gr.Gallery(label="Top Relevant Clips", type="filepath", columns=1)
    analysis = gr.Markdown(label="Analysis & Response")
    submit = gr.Button("Submit", variant="primary")
    submit.click(answer_query, inputs=question, outputs=[gallery, analysis])
if __name__ == "__main__":
    demo.launch(share=True)

How to Run

Execute all cells in order: dependencies → ETL → featurization → retrieval → Gradio app.
Open the Gradio link, ask your questions, and watch clips plus AI analysis.
Inspect clips/ for the trimmed MP4 segments.

Don't forget to set your <YOUR_HF_TOKEN>, `

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
Course-RAG.ipynb		Course-RAG.ipynb
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Example prompts as per project description

Repository Layout

Prerequisites

1) Environment & Dependencies

2) Download & Extract Videos

3) Load Captions

4) Merge & De-duplicate Captions

5) Sentence Segmentation

6) Quick BERTopic Sanity Check

7) Full BERTopic with CLIP Embeddings

8) Store Vectors in Qdrant

9) Query, Trim & Display Clips

10) Gradio App

How to Run

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Example prompts as per project description

Repository Layout

Prerequisites

1) Environment & Dependencies

2) Download & Extract Videos

3) Load Captions

4) Merge & De-duplicate Captions

5) Sentence Segmentation

6) Quick BERTopic Sanity Check

7) Full BERTopic with CLIP Embeddings

8) Store Vectors in Qdrant

9) Query, Trim & Display Clips

10) Gradio App

How to Run

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages