Skip to content

Sachin1801/Course-RAG-Assistant

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 

Repository files navigation

This folder is to be used to store the code for your project. # Chat with Your Video Library

Team:

  • Sachin Adlakha (sa9082)
  • Subhan Akhtar (sa8580)

We built this Jupyter notebook ( final_submission.ipynb) to guide a full RAG pipeline over our course videos—starting from raw downloads through to a live Gradio chatbot powered by retrieval and generation.


Example prompts as per project description

We have attached the screenshots to the example prompts.

The video demonstration can be seen at this link.


Repository Layout

final_submission.ipynb         # The main notebook to submit
data/raw/videos/         # Extracted video folders (+ VTT/JSON captions)
clips/                   # Output folder for trimmed MP4 clips
EXAMPLES.md

Prerequisites

  • Python 3.10+ and system ffmpeg installed
  • Hugging Face access token for dataset downloads
  • Qdrant Cloud cluster URL & API key
  • OpenRouter (or compatible LLM) API key

1) Environment & Dependencies

We install all required packages directly in the notebook:

!pip install -q \
    webvtt-py spacy==3.7.3 bertopic[visualization] \
    sentence-transformers qdrant-client==1.8.0 ffmpeg-python \
    git+https://github.com/openai/CLIP.git
!python -m spacy download en_core_web_sm
!pip install -q "huggingface_hub[hf_xet]"

Before launching the Gradio app, we also install:

!pip install -q gradio ffmpeg-python qdrant-client \
    sentence-transformers transformers torch bitsandbytes

2) Download & Extract Videos

We pull down and unpack the webdataset from Hugging Face Hub:

from huggingface_hub import login, hf_hub_download
from pathlib import Path
import tarfile

login(token="<YOUR_HF_TOKEN>")

dl_path = hf_hub_download(
    repo_id="aegean-ai/ai-lectures-spring-24",
    filename="youtube_dataset.tar",
    repo_type="dataset"
)
Path("data/raw").mkdir(parents=True, exist_ok=True)
tarfile.open(dl_path).extractall("data/raw")

After extraction, data/raw/videos/ contains one folder per YouTube video, each with an .mp4 and associated .vtt or .json caption file.


3) Load Captions

We define a helper load_captions(path: Path) that yields (text, start, end) tuples from either WebVTT or Whisper JSON. Then we iterate through every video directory:

from webvtt import WebVTT
import json, pandas as pd

def load_captions(path: Path):
    if path.suffix == ".vtt":
        for cue in WebVTT().read(path):
            yield cue.text.strip(), cue.start_in_seconds, cue.end_in_seconds
    else:
        with open(path) as f:
            data = json.load(f)
        for seg in data.get("segments", []):
            yield seg["text"].strip(), seg["start"], seg["end"]

rows = []
for vid_dir in Path("data/raw/videos").iterdir():
    for cap_file in vid_dir.glob("*.[vj][sot]*"):
        for text, start, end in load_captions(cap_file):
            rows.append({"video": vid_dir.name, "text": text, "start": start, "end": end})
df = pd.DataFrame(rows)

4) Merge & De-duplicate Captions

We merge cues separated by less than 1 second into blobs, then drop exact duplicates:

# merge consecutive cues into blobs
merged = []
# (see notebook code for merging logic)
df_long = pd.DataFrame(merged, columns=["video","blob","start","end"]) \
    .drop_duplicates("blob")

5) Sentence Segmentation

Using spaCy (en_core_web_sm), we split each blob into sentences, and assign each sentence a proportional slice of the blob's timestamp range:

import spacy
nlp = spacy.load("en_core_web_sm")
records = []
for vid, blob, t0, t1 in df_long.itertuples(index=False):
    doc = nlp(blob)
    spans = list(doc.sents)
    span_dur = (t1 - t0) / max(1, len(spans))
    for i, sent in enumerate(spans):
        records.append({
            "video": vid,
            "sentence": sent.text.strip(),
            "start": t0 + i*span_dur,
            "end": t0 + (i+1)*span_dur
        })
sent_df = pd.DataFrame(records)

6) Quick BERTopic Sanity Check

To validate our text pipeline, we sample 30 sentences and run a small UMAP+BERTopic model:

from umap import UMAP
from bertopic import BERTopic
sample = sent_df.sample(30, random_state=42)
umap_small = UMAP(n_neighbors=10, n_components=5, metric="cosine", random_state=42)
topic_model = BERTopic(umap_model=umap_small, calculate_probabilities=False, verbose=False)
topics, _ = topic_model.fit_transform(sample.sentence.tolist())

If this step succeeds, we move on to full-scale embedding.


7) Full BERTopic with CLIP Embeddings

We switch to CLIP-based embeddings for richer semantic features:

from sentence_transformers import SentenceTransformer
clip_model = SentenceTransformer("clip-ViT-B-32")

def truncate(text, max_tok=75):
    ids = clip_model.tokenizer(text)["input_ids"][ :max_tok]
    return clip_model.tokenizer.decode(ids, skip_special_tokens=True)

short_txts = [truncate(t) for t in sent_df.sentence]
topic_model = BERTopic(
    embedding_model=clip_model,
    language="multilingual",
    min_topic_size=5,
    verbose=True
)
topics, probs = topic_model.fit_transform(short_txts)

8) Store Vectors in Qdrant

We encode all sentences, build a metadata DataFrame, then push to our Qdrant Cloud cluster:

embeddings = clip_model.encode(short_txts, show_progress_bar=True, convert_to_numpy=True)
meta = pd.DataFrame({
    "video": sent_df.video,
    "start": sent_df.start,
    "end": sent_df.end,
    "sentence": short_txts,
    "topic": topics
})

from qdrant_client import QdrantClient, models
client = QdrantClient(url=QDRANT_URL, api_key=QDRANT_API_KEY)
client.recreate_collection(
    collection_name=COLLECTION_NAME,
    vectors_config=models.VectorParams(size=embeddings.shape[1], distance=models.Distance.COSINE)
)
client.upload_collection(
    collection_name=COLLECTION_NAME,
    vectors=embeddings.tolist(),
    payload=meta.to_dict("records"),
    ids=list(range(len(embeddings)))
)
print(f"✅ Uploaded {len(embeddings)} vectors to Qdrant collection '{COLLECTION_NAME}'")

9) Query, Trim & Display Clips

We test a sample query, merge time ranges by video, trim clips via ffmpeg-python, and display them:

from IPython.display import Video, display
results = client.search(collection_name=COLLECTION_NAME, query_vector=clip_model.encode([QUESTION])[0], limit=5)
# merge hits per video → (t0, t1)
# ffmpeg trim into clips/ folder
display(Video(str(out_fp), embed=True))

10) Gradio App

Finally, we assemble a live Gradio interface:

import gradio as gr
with gr.Blocks(title="Video Q&A with Multi-Video View") as demo:
    gr.Markdown("# Video Q&A with Multi-Video View")
    question = gr.Textbox(label="Your Question", placeholder="e.g., ...", lines=2)
    examples = gr.Examples([...], inputs=question)
    gallery = gr.Gallery(label="Top Relevant Clips", type="filepath", columns=1)
    analysis = gr.Markdown(label="Analysis & Response")
    submit = gr.Button("Submit", variant="primary")
    submit.click(answer_query, inputs=question, outputs=[gallery, analysis])
if __name__ == "__main__":
    demo.launch(share=True)

How to Run

  1. Execute all cells in order: dependencies → ETL → featurization → retrieval → Gradio app.
  2. Open the Gradio link, ask your questions, and watch clips plus AI analysis.
  3. Inspect clips/ for the trimmed MP4 segments.

Don't forget to set your <YOUR_HF_TOKEN>, `

About

A Custom RAG agent that can help assisting and saving time on long lectures and finding relevant video and information from the videos for you.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors