A real-time voice chat app powered by OpenAI's Realtime API with Slack integration for task handoff. Talk naturally, get instant voice responses, and delegate real work to an AI team in Slack.
┌─────────────┐ WebSocket ┌──────────────┐ WebSocket ┌─────────────────┐
│ Browser │ ◄─────────────► │ Node Server │ ◄─────────────► │ OpenAI Realtime │
│ (AudioWorklet) │ (Express) │ │ API (gpt-4o-mini│
│ │ │ │ │ -realtime) │
└─────────────┘ └──────┬───────┘ └─────────────────┘
│
Slack API
│
┌───────┴───────┐
│ Slack Team │
│ (OpenClaw / │
│ Claude) │
└───────────────┘
Two-brain architecture:
- Voice AI (OpenAI Realtime) = "front desk" — instant conversational responses, zero tools
- Slack AI (Claude via OpenClaw) = "back office" — writes code, runs commands, does real work
- Voice transcripts are posted to Slack → Claude picks them up → results are polled back and read aloud
- 🎙️ Real-time voice conversation (~300ms latency)
- 💬 Text input for typing or pasting URLs
- 📎 Image upload (drag & drop or file picker) → posts to Slack
- 📰 Article reading — paste a URL, AI reads the full article aloud
- 🔄 Cross-device sync (theme, channel, cost, playback speed)
- 🎨 Three themes: Dark, Light, Neon
- 📱 PWA — Add to Home Screen with app icon
- ♾️ Infinite scroll chat history
- 💰 Real-time cost tracking
- 🔇 Mute/unmute AI voice
- ⏹️ Stop button to interrupt responses
- Node.js 18+
- OpenAI API key with Realtime API access
- Two Slack apps (see Slack Setup below)
- A Slack workspace with OpenClaw (or any bot) listening on channels
git clone https://github.com/youruser/clawd-voice-chat.git
cd clawd-voice-chat
npm install
cp .env.example .env
# Edit .env with your keys (see below)
node server.jsCreate a .env file:
# Server
PORT=8470
# OpenAI — needs Realtime API access
OPENAI_API_KEY=sk-proj-your-key-here
# Slack Bot (your AI assistant's bot token)
SLACK_BOT_TOKEN=xoxb-your-bot-token
# Slack User (User app — posts voice transcripts as you)
SLACK_USER_TOKEN=xoxp-your-user-token
# Basic Auth (protects the web UI)
AUTH_USER=yourname
AUTH_PASS=your-secure-passwordYou need two separate Slack apps. This prevents the polling loop where the bot would read its own messages.
This app posts your voice transcripts to Slack so they appear as messages from you.
- Go to api.slack.com/apps → Create New App → From scratch
- Name it "Voice User" (or whatever), select your workspace
- Go to OAuth & Permissions
- Under User Token Scopes, add:
chat:write— post messages as youfiles:write— upload images/filesfiles:read— read uploaded fileschannels:read— list channels (for channel ID resolution)groups:read— list private channelsusers:read— resolve user infoidentify— basic identity
- Click Install to Workspace → Authorize
- Copy the User OAuth Token (
xoxp-...) → put in.envasSLACK_USER_TOKEN
This is your AI bot that does the actual work. If you're using OpenClaw, this is already set up.
- Create another Slack app (or use your existing bot)
- Go to OAuth & Permissions
- Under Bot Token Scopes, add:
chat:write— post bot messageschannels:history— read channel messages (for the poller)channels:read— list channelsgroups:history— read private channel messagesgroups:read— list private channelsusers:read— resolve bot user IDfiles:read— read files (for image proxy)files:write— upload files
- Install to Workspace → copy Bot User OAuth Token (
xoxb-...) → put in.envasSLACK_BOT_TOKEN - Invite the bot to the channels you want to use (
/invite @Clawd)
The server polls Slack for bot responses and reads them back via voice. If the voice transcripts were posted by the same bot, the poller would pick them up and create an infinite echo loop. The 🎙️ prefix on voice messages is an extra safety filter, but separate apps make it bulletproof.
Edit PROJECT_CONTEXTS in server.js to map dropdown options to Slack channels:
const PROJECT_CONTEXTS = {
do: {
name: '#do',
slackChannel: '#do',
context: 'General tasks and configuration.',
},
// Add more...
};Each project provides:
name— displayed in the dropdownslackChannel— where voice transcripts are posted and bot responses are polledcontext— additional system prompt context for the voice AI
# Install cloudflared
# Create a tunnel
cloudflared tunnel create voice-chat
cloudflared tunnel route dns <tunnel-id> voice.yourdomain.com
# Create ~/.cloudflared/config.yml
tunnel: <tunnel-id>
credentials-file: ~/.cloudflared/<tunnel-id>.json
ingress:
- hostname: voice.yourdomain.com
service: http://localhost:8470
- service: http_status:404
# Run it
cloudflared tunnel run voice-chatMake sure to proxy WebSocket connections (Upgrade and Connection headers).
# Create systemd service
sudo tee /etc/systemd/system/clawd-voice-chat.service << 'EOF'
[Unit]
Description=Clawd Voice Chat
After=network.target
[Service]
Type=simple
User=your-user
WorkingDirectory=/path/to/clawd-voice-chat
ExecStart=/usr/bin/node server.js
Restart=always
RestartSec=3
Environment=NODE_ENV=production
[Install]
WantedBy=multi-user.target
EOF
sudo systemctl daemon-reload
sudo systemctl enable --now clawd-voice-chatThe app includes a web manifest and service worker. On mobile:
- iOS: Share → Add to Home Screen
- Android: Three dots → Add to Home Screen
The service worker uses a network-first strategy — always fetches fresh code, falls back to cache if offline. Auto-checks for updates every 60 seconds.
- Browser captures mic audio via
AudioWorklet - Raw PCM → base64 → WebSocket → server → OpenAI Realtime API
- OpenAI's server-side VAD detects speech end → generates response
- Response audio streams back: OpenAI → server → browser →
AudioContextplayback - Transcript of your speech is posted to Slack as you (🎙️ prefix)
- Your voice transcript appears in Slack channel
- Your AI bot (OpenClaw/Claude/etc.) sees it and does the work
- Bot posts response in the same channel
- Server polls every 3s, finds new bot messages
- Debounces 5s, truncates to 500 chars for voice
- Injects into OpenAI conversation → voice AI relays it back to you
- Paste a URL in the text input
- Server fetches page → Mozilla Readability extracts clean text
- Text is chunked into 1500-char pieces
- Each chunk is sent sequentially to OpenAI (waits for
response.donebefore next) - Previous chunk's conversation item is deleted to prevent context overflow
- Interruptible — speaking or hitting stop cancels remaining chunks
- Drag & drop or 📎 button stages the image
- Type an optional message in the text input
- Send → Multer receives the file → Slack
files.uploadV2API posts it - Image appears in Slack with your message as the comment
- Slack poller detects the image → sends
slack_imageevent to browser - Browser renders thumbnail with click-to-zoom modal
clawd-voice-chat/
├── server.js # Express server, WebSocket relay, Slack integration (~880 lines)
├── db.js # SQLite database module (~100 lines)
├── package.json # Dependencies
├── .env # API keys and config (not committed)
├── voice-chat.db # SQLite database (auto-created)
└── public/
├── index.html # Single-page app (~1100 lines)
├── manifest.json # PWA manifest
├── sw.js # Service worker
├── icon-192.png # App icon (192x192)
├── icon-512.png # App icon (512x512)
└── avatar.jpg # User avatar
{
"@mozilla/readability": "^0.6.0",
"better-sqlite3": "^11.0.0",
"dotenv": "^16.0.0",
"express": "^4.18.0",
"linkedom": "^0.16.0",
"multer": "^1.4.0",
"ws": "^8.18.0"
}Record your screen and narrate bugs or feature requests — the AI analyzes the video and suggests fixes.
- Tap 🔴 next to the 📎 button
- Share your screen and talk through the issue
- Tap ⏹️ to stop — auto-uploads and analyzes
- GPT-4o Vision extracts frames + transcribes audio → identifies the bug → speaks the analysis back
Works on desktop Chrome. On mobile, falls back to video file upload (Android PWA doesn't support screen capture). Requires ffmpeg on the server.
Using gpt-4o-mini-realtime-preview:
- Input audio: $10/M tokens
- Output audio: $20/M tokens
- Text tokens are negligible
Typical conversation: $0.01-0.05 per back-and-forth. Article reading burns more ($0.10-0.50 per article due to audio output tokens). The cost tracker in the header shows daily spend in real-time.
The full gpt-4o-realtime-preview model is 10x more expensive. We chose mini for cost efficiency — the tradeoff is slightly less consistent accent/personality adherence.
Edit BASE_INSTRUCTIONS in server.js (~line 86). This is the system prompt for the voice AI.
CSS custom properties in index.html. Three themes defined: dark, light, neon. Add more by creating a new [data-theme="yourtheme"] block and adding it to the themeOrder array in the JS.
In server.js, the buildSessionConfig function sets:
turn_detection: {
type: 'server_vad',
threshold: 0.5, // Speech detection sensitivity
prefix_padding_ms: 300, // Audio kept before speech detected
silence_duration_ms: 800 // How long to wait after silence before responding
}Lower silence_duration_ms = faster responses but more false triggers. Higher = more natural pauses but slower.
MIT