A simplified, general-purpose voice assistant backend you can use as a starting point to build your own custom voice-driven applications.
This project lets you ask questions in voice and get AI-generated audio replies, demonstrating streaming audio processing, speech-to-text, LLM-powered responses, and text-to-speech output.
demo.mp4
- Accepts voice input and generates AI audio replies
- Streams audio for low latency
- Modular FastAPI backend
- Easy to extend for your own use case
- Personal voice assistants
- Customer support bots
- Interactive voice apps
- Smart home interfaces
⚠️ Requires Python 3.11 only
git clone https://github.com/your-username/your-repo-name.git
cd your-repo-nameMake sure you're using Python 3.11:
python --version
# Should output: Python 3.11.xIf not, install it from python.org.
python -m venv venv
source venv/bin/activate # macOS/Linux
venv\Scripts\activate # Windowspip install --upgrade pip
pip install -r requirements.txtThis project needs ffmpeg for audio processing:
-
macOS
brew install ffmpeg
-
Ubuntu/Linux
sudo apt update sudo apt install ffmpeg
-
Windows
Download from FFmpeg.org and add it to your PATH.
Get your Mistral API key by signing up and following their API Quickstart Guide.
Create a .env file:
API_KEY=your_mistral_api_key_here
Sure—here’s a clean snippet you can drop right into your README under the Setup section (e.g., after Install ffmpeg) or as a new Install eSpeak step:
This project uses eSpeak to enable additional Coqui TTS voice models.
💻 Windows
- Download and install eSpeak for Windows: 👉 https://espeak.sourceforge.net/
- After installation, add the
espeak/command-linefolder to your PATH environment variable so theespeakcommand is available in the terminal.
🐧 Linux (Ubuntu/Debian)
sudo apt update
sudo apt install espeak🍎 macOS
brew install espeakuvicorn src.main:app --reload --host 0.0.0.0 --port 8000Access it in your browser or API client at:
http://localhost:8000
- Visit the
/uiendpoint in your browser:
http://localhost:8000/ui
- Use the web interface to record your question.
- Hear an AI-generated audio reply instantly.
- Gradio web UI streams audio to the FastAPI backend.
- Voice Activity Detection (VAD) finds when speech starts and stops.
- Detected speech is transcribed to text.
- Text is sent to the LLM (Mistral) to generate a response.
- The response is converted to speech with TTS.
- Audio reply streams back to the user in real time.
- Swap in your preferred LLM
- Customize prompts or dialogue logic
- Add authentication and logging
- Containerize with Docker
- Deploy on AWS Lambda, ECS, etc.
Built to help anyone bootstrap their own voice assistant with clean, minimal code.
This project is licensed under the MIT License. See the LICENSE file for details.
Open an issue or reach out on LinkedIn.