A sophisticated AI-powered web application that translates video content across multiple languages using advanced machine learning models. The platform leverages Whisper for speech recognition, NLLB-200 for translation, and RAG (Retrieval-Augmented Generation) for context-aware processing.
- π₯ Video Upload & Processing - Support for multiple video formats (MP4, AVI, MOV, MKV, WAV)
- π Multilingual Translation - Translate videos to and from numerous languages
- π§ AI-Powered Intelligence - Summarize and analyze video transcripts with Google Gemini
- π Automatic Subtitle Generation - Generate and embed translated subtitles directly into videos
- β‘ Fast Processing - Efficient algorithms for rapid video translation
- π¬ Content Generation - Instantly get summaries of the videos using Generative AI
- πΎ Easy Download - Get your translated videos with embedded subtitles
| Component | Technology |
|---|---|
| Backend Framework | Flask (Python) |
| Speech Recognition | OpenAI Whisper |
| Translation Model | Facebook NLLB-200 |
| Content Summary | Google Gemini API (gemini-2.5-flash) |
| Video Processing | FFmpeg |
| Frontend | HTML5, CSS3, JavaScript, Bootstrap 5 |
| ML Libraries | PyTorch, Transformers, HuggingFace |
βββββββββββββββ
β Frontend β User uploads video
β (Browser) β
ββββββββ¬βββββββ
β
βΌ
βββββββββββββββ
β Flask API β Receives video file
ββββββββ¬βββββββ
β
βΌ
βββββββββββββββ
β Whisper β Extracts audio & transcribes
ββββββββ¬βββββββ
β
βΌ
βββββββββββββββ
β Gemini APIβ Context analysis & summarization
ββββββββ¬βββββββ
β
βΌ
βββββββββββββββ
β NLLB-200 β Translates content
ββββββββ¬βββββββ
β
βΌ
βββββββββββββββ
β FFmpeg β Embeds subtitles & generates video
ββββββββ¬βββββββ
β
βΌ
βββββββββββββββ
β User β Downloads translated video & reads summary
βββββββββββββββ
- Python 3.8 or higher
- FFmpeg installed on your system
- 8GB+ RAM (16GB recommended for optimal performance)
- 5GB+ free disk space for ML models
- Internet connection (for initial model download)
macOS / Linux:
chmod +x run.sh
./run.shWindows:
run.bat# Clone the repository
git clone https://github.com/raghulpranxsh/CrossLingualAI.git
cd CrossLingualAI
# Create virtual environment
python3 -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
# Install dependencies
pip install --upgrade pip
pip install -r requirements.txt
# Create necessary directories
mkdir -p uploads outputs
# Run the application
python app.pyOn the first run, the application will automatically download required ML models:
- Whisper Base Model (~150MB)
- NLLB-200 Translation Model (~1.2GB)
Note: This initial download may take 10-15 minutes depending on your internet speed. Models are cached locally for subsequent runs.
-
Start the Server
python app.py
-
Access the Application
- Open your browser and navigate to:
http://localhost:5001
- Open your browser and navigate to:
-
Upload Video
- Click "Choose File" or drag and drop your video
- Supported formats: MP4, AVI, MOV, MKV, WAV
- Maximum file size: 500MB
-
Configure Translation
- Select source language (or use Auto-detect)
- Select target language
- Click "Process Translation"
-
Download Result
- Wait for processing to complete
- Download your translated video with embedded subtitles
| Method | Endpoint | Description |
|---|---|---|
GET |
/ |
Main web interface |
GET |
/api/health |
Health check endpoint |
POST |
/api/upload |
Upload and process video |
GET |
/api/download/<filename> |
Download processed video |
# Health check
curl http://localhost:5001/api/health
# Upload video (using curl)
curl -X POST -F "file=@video.mp4" \
-F "sourceLanguage=auto" \
-F "targetLanguage=en" \
http://localhost:5001/api/upload- Audio Extraction: Video file is processed to extract audio track
- Speech Recognition: Whisper transcribes audio to text in the original language
- Language Detection: Automatic detection of source language (if not specified)
- Summary Generation: Transcript is sent to Google Gemini to get a concise summary
- Translation: NLLB-200 translates the transcribed text to target language
- Subtitle Generation: SRT file is created with translated subtitles and timestamps
- Video Processing: FFmpeg embeds subtitles into the original video
- Delivery: User receives the translated video with embedded subtitles and a textual summary
CrossLingualAI/
β
βββ app.py # Flask backend server
βββ index.html # Frontend web interface
βββ requirements.txt # Python dependencies
βββ run.sh # Setup script (macOS/Linux)
βββ run.bat # Setup script (Windows)
βββ README.md # Project documentation
β
βββ uploads/ # Temporary upload directory
βββ outputs/ # Processed video output directory
By default, the server runs on port 5001. To change this, modify app.py:
app.run(debug=True, host='0.0.0.0', port=5001) # Change port hereModels are automatically downloaded on first run. To use different Whisper models:
whisper_model = whisper.load_model("base") # Options: tiny, base, small, medium, largePort Already in Use
# Kill process on port 5001
lsof -ti :5001 | xargs kill -9FFmpeg Not Found
# macOS
brew install ffmpeg
# Linux
sudo apt-get install ffmpeg
# Windows
# Download from https://ffmpeg.org/download.htmlOut of Memory
- Close other applications
- Use smaller Whisper model (tiny/base instead of large)
- Process shorter videos
Models Not Downloading
- Check internet connection
- Verify disk space (need ~2GB free)
- Check HuggingFace access
- Processing Time: ~2-5 minutes for a 1-minute video
- Memory Usage: ~3-4GB during processing
- Supported Languages: 20+ languages via NLLB-200
- Video Formats: MP4, AVI, MOV, MKV, WAV
- Uploaded files are temporarily stored and automatically deleted after processing
- No user data is permanently stored
- All processing happens server-side
- Maximum file size limit: 500MB
Contributions are welcome! Please feel free to submit a Pull Request.
- Fork the repository
- Create your feature branch (
git checkout -b feature/AmazingFeature) - Commit your changes (
git commit -m 'Add some AmazingFeature') - Push to the branch (
git push origin feature/AmazingFeature) - Open a Pull Request
This project is open-source and available under the MIT License.
Raghul Pranesh K V
- π GitHub: @raghulpranxsh
- πΌ LinkedIn: raghulpraneshkv
- OpenAI for Whisper speech recognition model
- Facebook AI for NLLB-200 translation model
- HuggingFace for Transformers and Sentence Transformers
- Flask community for the excellent web framework