This folder is to be used to store the code for your project. # Chat with Your Video Library
Team:
- Sachin Adlakha (sa9082)
- Subhan Akhtar (sa8580)
We built this Jupyter notebook ( final_submission.ipynb) to guide a full RAG pipeline over our course videos—starting from raw downloads through to a live Gradio chatbot powered by retrieval and generation.
Example prompts as per project description
We have attached the screenshots to the example prompts.
The video demonstration can be seen at this link.
final_submission.ipynb # The main notebook to submit
data/raw/videos/ # Extracted video folders (+ VTT/JSON captions)
clips/ # Output folder for trimmed MP4 clips
EXAMPLES.md
- Python 3.10+ and system ffmpeg installed
- Hugging Face access token for dataset downloads
- Qdrant Cloud cluster URL & API key
- OpenRouter (or compatible LLM) API key
We install all required packages directly in the notebook:
!pip install -q \
webvtt-py spacy==3.7.3 bertopic[visualization] \
sentence-transformers qdrant-client==1.8.0 ffmpeg-python \
git+https://github.com/openai/CLIP.git
!python -m spacy download en_core_web_sm
!pip install -q "huggingface_hub[hf_xet]"Before launching the Gradio app, we also install:
!pip install -q gradio ffmpeg-python qdrant-client \
sentence-transformers transformers torch bitsandbytesWe pull down and unpack the webdataset from Hugging Face Hub:
from huggingface_hub import login, hf_hub_download
from pathlib import Path
import tarfile
login(token="<YOUR_HF_TOKEN>")
dl_path = hf_hub_download(
repo_id="aegean-ai/ai-lectures-spring-24",
filename="youtube_dataset.tar",
repo_type="dataset"
)
Path("data/raw").mkdir(parents=True, exist_ok=True)
tarfile.open(dl_path).extractall("data/raw")After extraction, data/raw/videos/ contains one folder per YouTube video, each with an .mp4 and associated .vtt or .json caption file.
We define a helper load_captions(path: Path) that yields (text, start, end) tuples from either WebVTT or Whisper JSON. Then we iterate through every video directory:
from webvtt import WebVTT
import json, pandas as pd
def load_captions(path: Path):
if path.suffix == ".vtt":
for cue in WebVTT().read(path):
yield cue.text.strip(), cue.start_in_seconds, cue.end_in_seconds
else:
with open(path) as f:
data = json.load(f)
for seg in data.get("segments", []):
yield seg["text"].strip(), seg["start"], seg["end"]
rows = []
for vid_dir in Path("data/raw/videos").iterdir():
for cap_file in vid_dir.glob("*.[vj][sot]*"):
for text, start, end in load_captions(cap_file):
rows.append({"video": vid_dir.name, "text": text, "start": start, "end": end})
df = pd.DataFrame(rows)We merge cues separated by less than 1 second into blobs, then drop exact duplicates:
# merge consecutive cues into blobs
merged = []
# (see notebook code for merging logic)
df_long = pd.DataFrame(merged, columns=["video","blob","start","end"]) \
.drop_duplicates("blob")Using spaCy (en_core_web_sm), we split each blob into sentences, and assign each sentence a proportional slice of the blob's timestamp range:
import spacy
nlp = spacy.load("en_core_web_sm")
records = []
for vid, blob, t0, t1 in df_long.itertuples(index=False):
doc = nlp(blob)
spans = list(doc.sents)
span_dur = (t1 - t0) / max(1, len(spans))
for i, sent in enumerate(spans):
records.append({
"video": vid,
"sentence": sent.text.strip(),
"start": t0 + i*span_dur,
"end": t0 + (i+1)*span_dur
})
sent_df = pd.DataFrame(records)To validate our text pipeline, we sample 30 sentences and run a small UMAP+BERTopic model:
from umap import UMAP
from bertopic import BERTopic
sample = sent_df.sample(30, random_state=42)
umap_small = UMAP(n_neighbors=10, n_components=5, metric="cosine", random_state=42)
topic_model = BERTopic(umap_model=umap_small, calculate_probabilities=False, verbose=False)
topics, _ = topic_model.fit_transform(sample.sentence.tolist())If this step succeeds, we move on to full-scale embedding.
We switch to CLIP-based embeddings for richer semantic features:
from sentence_transformers import SentenceTransformer
clip_model = SentenceTransformer("clip-ViT-B-32")
def truncate(text, max_tok=75):
ids = clip_model.tokenizer(text)["input_ids"][ :max_tok]
return clip_model.tokenizer.decode(ids, skip_special_tokens=True)
short_txts = [truncate(t) for t in sent_df.sentence]
topic_model = BERTopic(
embedding_model=clip_model,
language="multilingual",
min_topic_size=5,
verbose=True
)
topics, probs = topic_model.fit_transform(short_txts)We encode all sentences, build a metadata DataFrame, then push to our Qdrant Cloud cluster:
embeddings = clip_model.encode(short_txts, show_progress_bar=True, convert_to_numpy=True)
meta = pd.DataFrame({
"video": sent_df.video,
"start": sent_df.start,
"end": sent_df.end,
"sentence": short_txts,
"topic": topics
})
from qdrant_client import QdrantClient, models
client = QdrantClient(url=QDRANT_URL, api_key=QDRANT_API_KEY)
client.recreate_collection(
collection_name=COLLECTION_NAME,
vectors_config=models.VectorParams(size=embeddings.shape[1], distance=models.Distance.COSINE)
)
client.upload_collection(
collection_name=COLLECTION_NAME,
vectors=embeddings.tolist(),
payload=meta.to_dict("records"),
ids=list(range(len(embeddings)))
)
print(f"✅ Uploaded {len(embeddings)} vectors to Qdrant collection '{COLLECTION_NAME}'")We test a sample query, merge time ranges by video, trim clips via ffmpeg-python, and display them:
from IPython.display import Video, display
results = client.search(collection_name=COLLECTION_NAME, query_vector=clip_model.encode([QUESTION])[0], limit=5)
# merge hits per video → (t0, t1)
# ffmpeg trim into clips/ folder
display(Video(str(out_fp), embed=True))Finally, we assemble a live Gradio interface:
import gradio as gr
with gr.Blocks(title="Video Q&A with Multi-Video View") as demo:
gr.Markdown("# Video Q&A with Multi-Video View")
question = gr.Textbox(label="Your Question", placeholder="e.g., ...", lines=2)
examples = gr.Examples([...], inputs=question)
gallery = gr.Gallery(label="Top Relevant Clips", type="filepath", columns=1)
analysis = gr.Markdown(label="Analysis & Response")
submit = gr.Button("Submit", variant="primary")
submit.click(answer_query, inputs=question, outputs=[gallery, analysis])
if __name__ == "__main__":
demo.launch(share=True)- Execute all cells in order: dependencies → ETL → featurization → retrieval → Gradio app.
- Open the Gradio link, ask your questions, and watch clips plus AI analysis.
- Inspect
clips/for the trimmed MP4 segments.
Don't forget to set your <YOUR_HF_TOKEN>, `