Skip to content

mayo3030/edx-scraper

Repository files navigation

edX Video Scraper & Downloader

A Python toolkit for downloading edX course videos as MP4 files with their real names.

Features

  • Capture video URLs - Automatically navigates through course pages and captures HLS stream URLs
  • Download as MP4 - Parallel downloads with real video/module names
  • Session management - Save browser login state for authenticated access
  • Course scraping - Extract course content, structure, and text

Requirements

  • Python 3.10+
  • Playwright
  • yt-dlp

Installation

# Clone the repository
git clone https://github.com/YOUR_USERNAME/edx-scraper.git
cd edx-scraper

# Install dependencies
pip install -r requirements.txt

# Install Playwright browser
playwright install chromium

Quick Start

Step 1: Login & Save Session

python login.py

A browser will open. Log into edX, then press Enter to save your session.

Step 2: Capture Video URLs

python capture_videos.py --auto --out output

This will:

  • Open a browser and navigate through all course modules
  • Capture video stream URLs with their real names
  • Save to output/captured_videos.json

Step 3: Download Videos

python download_videos.py --urls output/captured_videos.json --out output/videos --workers 5

Videos will be saved with real names like:

001_Module_1_Introduction.mp4
002_Week_2_Classification.mp4

Alternative: Use Batch Files (Windows)

  1. Double-click RUN_CAPTURE.bat
  2. Wait for capture to complete
  3. Double-click RUN_DOWNLOAD.bat

Project Structure

edx-scraper/
├── login.py              # Save browser session
├── capture_videos.py     # Capture video URLs from course
├── download_videos.py    # Download videos as MP4
├── scrape.py             # Scrape course text content
├── edx_scraper/          # Core library
│   ├── __init__.py
│   └── scraper.py
├── RUN_CAPTURE.bat       # Windows batch for capture
├── RUN_DOWNLOAD.bat      # Windows batch for download
├── requirements.txt
└── README.md

Command Options

capture_videos.py

--auto          Auto-navigate through modules
--out           Output directory (default: output)
--course-url    Course URL to scrape
--max-pages     Maximum pages to visit (default: 100)

download_videos.py

--urls          URL file (json or txt)
--out           Output directory for videos
--workers       Parallel downloads (default: 5)

Notes

  • Your login session (edx_storage_state.json) is gitignored - never commit this file
  • Video URLs expire after some time, so download soon after capturing
  • Some courses may have DRM protection that prevents downloading

License

MIT License - For educational purposes only.

About

Python scraper for online course platforms and education data collection.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors