All API should be added on own, There is instruction how to do it
Real-time voice AI assistant built with Python, LiveKit, Deepgram, GPT-4.1 & Cartesia TTS
Features autonomous actions, medical guidance, strong privacy, and fully customizable voice cloning technology
Features • Installation • Voice Cloning • Usage • Architecture • Contributing
Youtube Demo video : https://youtu.be/sKE2JrOVLhs
Vienna is a cutting-edge voice AI assistant powered by the LiveKit Agents framework, designed specifically for medical interactions with Indian English support. Built by Avijit Shil, Vienna combines advanced speech recognition, natural language processing, and text-to-speech synthesis to deliver conversational, witty, and highly capable AI assistance.
Vienna features revolutionary voice cloning technology that allows it to speak in ANY voice you want - a revolutionary next-generation multilingual voice assistant designed for real-world chaos , including your own! Using Cartesia's Sonic 3 TTS engine, Vienna can be customized to sound exactly like you, a loved one, or any voice profile you create. The current deployment uses the creator's own cloned voice for authentic, personalized interactions.
Perfect for agentic healthcare & productivity applications, deployable anywhere with enterprise-grade privacy and security. Expertising in medical conversations like a human, takes less time than a doctor to response. Vienna shatters the language barrier with comprehensive support for Indian regional languages (Assamese, Bengali, Tamil, Telugu, Malayalam, and 10+ more) alongside international languages (English, Chinese, Spanish, Arabic, and 90+ more). With advanced Background Voice Cancellation (BVC), BoliGen isolates your voice even in chaotic environments like crowded markets, busy streets, or noisy offices - making it the first truly noise-proof, hyper-realistic AI companion.
Perfect for global communication, healthcare, education & customer service, deployable anywhere with enterprise-grade privacy.
| Feature | Description |
|---|---|
| 🏥 Medical-Grade STT | Deepgram Nova 3 Medical model with 95%+ accuracy on medical terminology |
| 🇮🇳 Indian English Support | Optimized for Indian accents and dialects (en-IN) with regional variations |
| 🧠 GPT-4.1 Mini Intelligence | Fast, efficient responses with advanced reasoning and medical knowledge |
| 🎭 Revolutionary Voice Cloning | Clone ANY voice in 2-3 minutes - your own, family, friends, or custom personas |
| 🗣️ Natural Voice Synthesis | Cartesia Sonic 3 TTS with fully customizable voice options and emotions |
| 🛠️ Autonomous Actions | 10+ integrated tools: web search, email, weather, location services, and more |
| ⚡ High-Performance VAD | Silero VAD/TurnDetector with <50ms latency and 98%+ accuracy |
| 🎙️ Real-time Processing | End-to-end latency <500ms with preemptive generation technology |
| 🔇 Advanced Noise Cancellation | LiveKit BVC removes background noise, echo, and ambient sounds |
| 🔒 Enterprise Security | Zero data retention, encrypted WebRTC, HIPAA-ready architecture |
| 🎭 Revolutionary Voice Cloning Clone ANY voice in 2-3 minutes - your own, family, friends, or custom personas | |
| 🗣️ Hyper-Realistic Voice Cartesia Sonic TTS with fully customizable voice options and emotions |
| Component | Technology | Description |
|---|---|---|
| Speech-to-Text | Deepgram Nova 3 Medical | Medical-grade transcription with specialized vocabulary for healthcare |
| Text-to-Speech | Cartesia Sonic 3 | Custom voice cloning - speaks in ANY voice you want |
| Voice Activity Detection | Silero VAD/TurnDetector | High-performance detection with minimal latency (<50ms) |
| Noise Cancellation | LiveKit BVC | Real-time background noise, echo, and ambient sound removal |
| Voice Customization | ✅ Fully Customizable | Clone any voice in 2-3 minutes or use preset professional voices |
| Language Support | Multi-language | Indian English (en-IN), US English (en-US), UK English (en-GB), and more |
Vienna Aka Medpulse supports 100+ languages with native pronunciation and cultural context awareness.
| Language | Script | Native Name | Status |
|---|---|---|---|
| Assamese | অসমীয়া | Ôxômiya | ✅ Full Support |
| Bengali | বাংলা | Bangla | ✅ Full Support |
| Gujarati | ગુજરાતી | Gujarātī | ✅ Full Support |
| Kannada | ಕನ್ನಡ | Kannaḍa | ✅ Full Support |
| Malayalam | മലയാളം | Malayāḷam | ✅ Full Support |
| Marathi | मराठी | Marāṭhī | ✅ Full Support |
| Nepali | नेपाली | Nepālī | ✅ Full Support |
| Punjabi | ਪੰਜਾਬੀ | Pañjābī | ✅ Full Support |
| Pashto | پښتو | Pax̌tō | ✅ Full Support |
| Sanskrit | संस्कृतम् | Saṃskṛtam | ✅ Full Support |
| Sindhi | سنڌي | Sindhī | ✅ Full Support |
| Sinhala | සිංහල | Siṁhala | ✅ Full Support |
| Tamil | தமிழ் | Tamiḻ | ✅ Full Support |
| Telugu | తెలుగు | Telugu | ✅ Full Support |
| Urdu | اردو | Urdū | ✅ Full Support |
Vienna currently uses the creator's own cloned voice for authentic interactions:
- Voice ID:
abc6eacf-2626-4d87-902b-7b49c2dcae16 - Characteristics: Natural Indian English accent, warm tone, conversational style
- Use Case: Medical consultations, health advice, general assistance
- ✅ Conversational Intelligence - GPT-4.1 Mini delivers fast, accurate, contextually-aware responses
- ✅ Medical Knowledge - Specialized training on medical terminology, symptoms, and healthcare guidance
- ✅ Context Awareness - Maintains full conversation history and understands references to previous topics
- ✅ Preemptive Generation - Begins formulating response while user is still speaking for instant replies
- ✅ Dry Humor & Personality - Engaging, witty responses with a sophisticated sense of humor (never sarcastic or rude)
- ✅ Multi-turn Reasoning - Handles complex, multi-step queries with logical follow-through
- ✅ Privacy-First Design - All processing in real-time, zero data retention
Vienna can perform various autonomous actions through integrated function tools. All tools require user confirmation for sensitive operations.
✅ open_website(url) # Open any website in browser
✅ search_web(query) # Perform real-time web searches
✅ get_news(topic) # Fetch latest news headlines and articles
✅ get_stock_price(symbol) # Check stock/crypto prices (e.g., AAPL, BTC)✅ get_datetime() # Current date and time with timezone
✅ lookup_weather(location) # Real-time weather information for any location✅ send_email(to, subject, body) # Send emails with confirmation
✅ read_emails(count) # Read unread emails (privacy-protected)✅ find_nearby_places(type) # Find restaurants, ATMs, hospitals, pharmacies, etc.Coming Soon:
- Calendar integration
- Reminders and alarms
- File management
- Smart home control
- Appointment scheduling
┌─────────────────────────────────────────────────────────────────┐
│ VIENNA AI PIPELINE │
│ Real-time Voice Processing │
└─────────────────────────────────────────────────────────────────┘
User Voice Input (Microphone)
↓
┌─────────────────────────────────┐
│ Noise Cancellation (LiveKit BVC)│ ← Removes background noise, echo
└────────┬────────────────────────┘
↓
┌─────────────────────────────────┐
│ Voice Activity Detection (VAD) │ ← Silero VAD/TurnDetector
│ Latency: <50ms │ Detects speech start/end
└────────┬────────────────────────┘
↓
┌─────────────────────────────────┐
│ Speech-to-Text (Deepgram) │ ← Nova 3 Medical Model
│ Model: nova-3-medical │ Language: en-IN
│ Accuracy: 95%+ on medical terms │
└────────┬────────────────────────┘
↓
┌─────────────────────────────────┐
│ LLM Processing (GPT-4.1 Mini) │ ← Natural language understanding
│ + Function Tool Calling │ Context-aware reasoning
│ Latency: ~200ms │ Autonomous action selection
└────────┬────────────────────────┘
↓
┌─────────────────────────────────┐
│ Text-to-Speech (Cartesia) │ ← Sonic 3 Engine
│ Voice: Custom Cloned Voice │ Voice ID: abc6eacf-...
│ Latency: ~150ms │ Natural intonation
└────────┬────────────────────────┘
↓
Voice Output (Speakers/Headphones)
Total End-to-End Latency: <500ms
- Agent Session Manager - Orchestrates the entire voice pipeline with LiveKit
- STT Engine - Deepgram Nova 3 Medical converts speech to text with medical vocabulary
- LLM Core - GPT-4.1 Mini processes queries and decides when to use autonomous tools
- Function Tools - 10+ autonomous action capabilities with safety guardrails
- TTS Engine - Cartesia Sonic 3 generates natural-sounding voice with cloned voice model
- Metrics Collector - Real-time monitoring of performance, usage, and quality metrics
- End-to-End Latency: <500ms (from speech to response)
- STT Accuracy: 95%+ on medical terminology
- VAD Latency: <50ms (voice activity detection)
- TTS Quality: Natural, human-like with emotional expressiveness
- Concurrent Users: Scales horizontally with LiveKit infrastructure
- Uptime: 99.9%+ with proper deployment
- Python 3.8 or higher (Python 3.10+ recommended)
- LiveKit Cloud account or self-hosted LiveKit server
- API keys for:
git clone https://github.com/Avijitshil/Vienna.git
cd Viennapip install -r requirements.txtRequired packages:
livekit-agents>=0.8.0
livekit-plugins-deepgram
livekit-plugins-openai
livekit-plugins-cartesia
livekit-plugins-silero
python-dotenvOr install manually:
pip install livekit-agents livekit-plugins-deepgram livekit-plugins-openai livekit-plugins-cartesia livekit-plugins-silero python-dotenvCreate a .env.local file in the project root:
# LiveKit Configuration
LIVEKIT_URL=wss://your-livekit-server.livekit.cloud
LIVEKIT_API_KEY=your_api_key_here
LIVEKIT_API_SECRET=your_api_secret_here
# AI Service API Keys
DEEPGRAM_API_KEY=your_deepgram_api_key
OPENAI_API_KEY=your_openai_api_key
CARTESIA_API_KEY=your_cartesia_api_keyGetting API Keys:
- LiveKit: Sign up at livekit.io → Create project → Get credentials
- Deepgram: Sign up at deepgram.com → Get API key (free tier available)
- OpenAI: Sign up at platform.openai.com → Create API key
- Cartesia: Sign up at cartesia.ai → Get API key
Start Vienna in console mode:
python src/agent.py consoleYou should see:
✅ Environment loaded successfully
✅ Connecting to LiveKit...
✅ Agent started and listening for voice input
🎤 Vienna is ready! Start speaking...
Voice cloning makes Vienna feel personal and authentic. Instead of a generic AI voice, Vienna can speak exactly like you, a trusted healthcare provider, or any voice that makes users comfortable.
Requirements:
- 30 seconds to 2 minutes of clear speech
- Quiet environment (minimal background noise)
- Natural speaking style (not robotic or overly formal)
- Good quality microphone (built-in laptop mic works fine)
Tips for Best Results:
- Speak naturally and conversationally
- Include variation in tone and emotion
- Read a paragraph from a book or article
- Don't pause too long between sentences
- Visit Cartesia Voice Lab
- Sign up / Log in to your account
- Click "Create Voice" or "Clone Voice"
- Upload your audio recording
- Wait 30-60 seconds for processing
- Preview the cloned voice
- Copy your Voice ID (format:
xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx)
Edit src/agent.py and update the TTS configuration:
# Find this section in the entrypoint() function:
tts=inference.TTS(
model="cartesia/sonic-3",
voice="abc6eacf-2626-4d87-902b-7b49c2dcae16" # Replace with YOUR voice ID
)Replace the voice ID with your new cloned voice ID:
tts=inference.TTS(
model="cartesia/sonic-3",
voice="your-new-voice-id-here" # Your cloned voice ID from Cartesia
)Restart Vienna:
python src/agent.py consoleVienna now speaks in YOUR voice! 🎉
Create different voices for different scenarios:
# Professional medical voice for consultations
MEDICAL_VOICE = "xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx"
# Friendly casual voice for general chat
CASUAL_VOICE = "yyyyyyyy-yyyy-yyyy-yyyy-yyyyyyyyyyyy"
# Urgent/serious voice for emergencies
URGENT_VOICE = "zzzzzzzz-zzzz-zzzz-zzzz-zzzzzzzzzzzz"
# Switch based on context
voice_id = MEDICAL_VOICE if is_medical_query else CASUAL_VOICEFine-tune voice characteristics:
tts=inference.TTS(
model="cartesia/sonic-3",
voice="your-voice-id",
# Additional parameters (if supported by your SDK version):
speed=1.0, # 0.5 to 2.0 (1.0 = normal)
pitch=1.0, # 0.5 to 2.0 (1.0 = normal)
emotion="neutral" # Options: neutral, happy, sad, urgent, calm
)If you don't want to clone a voice, use Cartesia's preset voices:
Available Categories:
- Medical Professional: Warm, authoritative, trustworthy
- Young Adult: Energetic, friendly, casual
- Elderly: Wise, calm, reassuring
- Gender-Neutral: Professional, clear, unbiased
Example Voice IDs:
# Professional female medical voice
voice="79a125e8-cd45-4c13-8a67-188112f4dd22"
# Friendly male casual voice
voice="a0e99841-438c-4a64-b679-ae501e7d6091"
# Neutral professional voice
voice="2ee87190-8f84-4925-97da-e52547f9462c"Browse all voices at: cartesia.ai/voices
Important Guidelines:
✅ DO:
- Clone your own voice
- Clone voices with explicit written consent
- Use for personal, educational, or authorized medical purposes
- Respect voice ownership and intellectual property
❌ DON'T:
- Clone someone's voice without permission
- Impersonate others maliciously
- Use cloned voices for fraud or deception
- Share voice models without authorization
Vienna's Position: We strongly advocate for ethical AI voice cloning. Always obtain consent before cloning someone's voice, and use the technology responsibly for beneficial purposes.
-
Run the agent in console mode:
python src/agent.py console
-
Connect to the LiveKit room using:
- Web browser (LiveKit web client)
- Mobile app (iOS/Android LiveKit app)
- Desktop application
- Custom integration
-
Start speaking - Vienna will respond in real-time with the cloned voice!
Medical Query:
User: "What are the symptoms of diabetes?"
Vienna: "Common symptoms of diabetes include increased thirst,
frequent urination, unexplained weight loss, extreme
fatigue, blurred vision, and slow-healing wounds.
Type 1 symptoms often appear suddenly, while Type 2
develops gradually. Would you like more details about
any specific symptom?"
Autonomous Action - Web Search:
User: "Search for nearby hospitals"
Vienna: "Sure, let me find nearby hospitals for you...
I've found several options: City General Hospital
is 2.3 km away with 4.5 star rating, Care Medical
Center is 3.1 km with emergency services, and
Apollo Hospital is 4.7 km with specialized cardiology.
Would you like directions to any of these?"
Weather Information:
User: "What's the weather like in Mumbai?"
Vienna: "The weather in Mumbai right now is sunny with
partly cloudy skies. Temperature is 27 degrees
Celsius with 65% humidity. There's a light breeze
from the west. It's a pleasant day overall!"
Complex Medical Reasoning:
User: "I've been having chest pain and shortness of breath"
Vienna: "⚠️ These symptoms can be serious. Chest pain combined
with shortness of breath could indicate several
conditions ranging from anxiety to cardiac issues.
I strongly recommend you seek immediate medical
attention or call emergency services right away.
Do you want me to help you find the nearest emergency
room?"
Email Assistance:
User: "Send an email to my doctor about tomorrow's appointment"
Vienna: "I can help you draft that email. What would you like
to say in the message? Should I mention a specific
time or any concerns you'd like to discuss?"
python src/agent.py console- Direct terminal interaction
- Real-time logging and debugging
- Best for development and testing
python src/agent.py- Runs as a persistent service
- Connects to LiveKit cloud
- Handles multiple concurrent users
- Enterprise-ready with metrics
docker build -t vienna-ai .
docker run -d --env-file .env.local vienna-aiVienna currently uses the creator's cloned voice for authentic, personalized interactions.
To use your own voice:
- Visit Cartesia Voice Lab
- Clone your voice (takes 2-3 minutes with 30 seconds of audio)
- Copy your custom voice ID
- Update in
src/agent.py:
tts=inference.TTS(
model="cartesia/sonic-3",
voice="your-custom-voice-id" # Replace with your Voice ID
)Available Voice Options:
- ✅ Custom cloned voices (your own voice or anyone with consent)
- ✅ Preset professional voices (50+ options)
- ✅ Multiple accent options (Indian, American, British, Australian, etc.)
- ✅ Adjustable parameters (speed, pitch, emotion)
- ✅ Gender-neutral voices (inclusive and professional)
Change the speech recognition language:
stt=inference.STT(
model="deepgram/nova-3-medical",
language="en-US" # Options: en-IN, en-US, en-GB, en-AU, etc.
)Supported Languages:
en-IN- Indian English (default for Vienna)en-US- American Englishen-GB- British Englishen-AU- Australian English- And many more...
Change the AI model for different capabilities:
# GPT-4.1 Mini (default) - Fast and cost-effective
llm=inference.LLM(model="openai/gpt-4.1-mini")
# GPT-4 - More complex reasoning, higher cost
llm=inference.LLM(model="openai/gpt-4")
# GPT-4 Turbo - Balance of speed and capability
llm=inference.LLM(model="openai/gpt-4-turbo")Configure VAD sensitivity:
# In the prewarm function, adjust VAD parameters
vad = silero.VAD(
min_speech_duration_ms=100, # Minimum speech duration
speech_pad_ms=300, # Padding around speech
threshold=0.5 # Detection threshold (0.0-1.0)
)Customize which autonomous tools are available:
# Comment out tools you don't want to use
class Assistant(Agent):
def __init__(self) -> None:
super().__init__(instructions="...")
# @function_tool # Uncomment to disable
# async def send_email(self, ...):
# passVienna automatically collects and logs comprehensive performance metrics:
-
Latency Metrics
- End-to-end response time (typically <500ms)
- STT processing time
- LLM inference time
- TTS generation time
- Network latency
-
Quality Metrics
- STT word error rate
- VAD accuracy
- TTS naturalness score
- User satisfaction indicators
-
Usage Metrics
- Token consumption (LLM)
- API call counts
- Concurrent user sessions
- Tool invocation frequency
-
Cost Metrics
- Per-session cost breakdown
- Monthly spending projections
- Cost per user interaction
Metrics are logged in real-time:
# Metrics are automatically collected
usage_collector = metrics.UsageCollector()
@session.on("metrics_collected")
def _on_metrics_collected(ev: MetricsCollectedEvent):
metrics.log_metrics(ev.metrics)
usage_collector.collect(ev.metrics)
# Get summary
summary = usage_collector.get_summary()
logger.info(f"Session Usage: {summary}")Session Metrics:
- Duration: 5m 32s
- Total Interactions: 12
- Avg Response Time: 387ms
- STT Accuracy: 96.3%
- Token Usage: 2,847 tokens
- Estimated Cost: $0.08
- Tools Invoked: 3 (web_search, weather, datetime)
Vienna is built with privacy-first design principles:
- ✅ Zero Data Retention - All conversations processed in real-time, nothing stored
- ✅ No Conversation Logs - No audio recordings or transcripts saved
- ✅ Ephemeral Processing - Data deleted immediately after response
- ✅ No Training Data - Your conversations never used to train AI models
- ✅ Encrypted Communication - End-to-end encryption via WebRTC
- ✅ Secure API Keys - Environment variables, never hardcoded
- ✅ User Confirmation - Required for sensitive actions (email, file access)
- ✅ HIPAA-Ready Architecture - Compliant with healthcare data regulations
- ✅ Rate Limiting - Prevents abuse and unauthorized access
- ✅ Audit Trails - Optional logging for compliance (disabled by default)
- HIPAA Compliance Ready - Architecture supports HIPAA requirements
- PHI Protection - No Protected Health Information stored or logged
- Consent-Based - Users control what information is shared
- Secure Integrations - Third-party tools require explicit authorization
- Never share API keys publicly or in version control
- Use environment variables for all sensitive configuration
- Enable rate limiting in production deployments
- Regular security audits of dependencies and configurations
- User education about what Vienna can and cannot access
Contributions are welcome! We'd love your help making Vienna even better.
-
Fork the repository
git clone https://github.com/yourusername/Vienna.git cd Vienna -
Create a feature branch
git checkout -b feature/AmazingFeature
-
Make your changes
- Write clean, documented code
- Follow existing code style
- Add tests if applicable
- Update documentation
-
Commit your changes
git commit -m 'Add some AmazingFeature' -
Push to your branch
git push origin feature/AmazingFeature
-
Open a Pull Request
- Describe your changes clearly
- Reference any related issues
- Wait for review and feedback
- Code Style: Follow PEP 8 for Python code
- Documentation: Update README for new features
- Testing: Add tests for new functionality
- Commits: Use clear, descriptive commit messages
- Issues: Open issues for bugs or feature requests
- 🌐 Language Support - Add support for more languages
- 🎤 Voice Models - Contribute new voice profiles
- 🛠️ New Tools - Add autonomous action capabilities
- 📚 Documentation - Improve guides and examples
- 🐛 Bug Fixes - Report and fix issues
- ⚡ Performance - Optimize latency and resource usage
All autonomous tools are documented below. Vienna automatically invokes these based on user requests.
Opens a specified website URL in the user's default browser.
Parameters:
url(str): Website URL to open (must include protocol: http:// or https://)
Returns:
- Confirmation message indicating the website is being opened
Example:
User: "Open google.com"
Vienna calls: open_website("https://google.com")
Vienna says: "Opening google.com for you now."Performs a web search for the given query and returns relevant results.
Parameters:
query(str): Search query string
Returns:
- Search results summary and top links
Example:
User: "Search for diabetes symptoms"
Vienna calls: search_web("diabetes symptoms")
Vienna says: "Here are the search results for diabetes symptoms..."Retrieves the current date and time.
Parameters: None
Returns:
- Formatted datetime string (e.g., "02:30 PM on Monday, November 13, 2025")
Example:
User: "What time is it?"
Vienna calls: get_datetime()
Vienna says: "It's 02:30 PM on Monday, November 13, 2025."Gets current weather information for a specified location.
Parameters:
location(str): City or location name (e.g., "Mumbai", "New York")
Returns:
- Weather information including temperature, conditions, humidity
Example:
User: "What's the weather in Mumbai?"
Vienna calls: lookup_weather("Mumbai")
Vienna says: "The weather in Mumbai is sunny, 27°C with 65% humidity."Fetches latest news headlines and articles about a specific topic.
Parameters:
topic(str): News topic or category (e.g., "technology", "health", "sports")
Returns:
- List of recent news headlines and brief summaries
Example:
User: "Get me the latest health news"
Vienna calls: get_news("health")
Vienna says: "Here are the latest health headlines..."Retrieves current stock or cryptocurrency prices.
Parameters:
symbol(str): Stock ticker symbol (e.g., "AAPL", "GOOGL") or crypto symbol (e.g., "BTC", "ETH")
Returns:
- Current price, change, and percentage change
Example:
User: "What's Apple's stock price?"
Vienna calls: get_stock_price("AAPL")
Vienna says: "AAPL is currently trading at $188.40, up 2.3% today."