diff --git a/README.md b/README.md index 6405c27..6a785c2 100644 --- a/README.md +++ b/README.md @@ -26,7 +26,7 @@ graph TB TTS["tts.py
ElevenLabs → OpenAI
→ Polly (fallback)
Cache (MD5)
Circuit breaker"] Owner["owner_channel.py
Signal notify
Signal poll (3s)
Slash commands
Instructions"] Contact["contact_lookup.py
contacts.json
Twilio CNAM
E.164 normalize
Lang from prefix"] - I18n["i18n.py
11+ languages
Signal templates
Polly voices
Twilio codes"] + I18n["i18n.py
13+ languages
Signal templates
Polly voices
Twilio codes"] end SignalCLI["signal-cli :8080
REST API
Native mode
Self-hosted"] @@ -185,7 +185,7 @@ sequenceDiagram ```mermaid flowchart TD - Start([CALL START]) --> Prefix["Phone prefix detection
+41 → de-CH
+48 → pl-PL
+44 → en-GB
(52 prefixes)"] + Start([CALL START]) --> Prefix["Phone prefix detection
+41 → de-CH
+48 → pl-PL
+44 → en-GB
(54 prefixes)"] Prefix --> ContactCheck{Contact has
lang override?} ContactCheck -->|Yes| ContactLang["Use contact language
contacts.json
e.g. {name: ..., lang: pl}"] diff --git a/ROADMAP.md b/ROADMAP.md index 1e42eef..4642d50 100644 --- a/ROADMAP.md +++ b/ROADMAP.md @@ -8,10 +8,13 @@ - [x] Core call flow: Twilio STT → GPT-4o → ElevenLabs TTS - [x] Signal integration: notifications, live updates, owner instructions mid-call -- [x] Multilingual support: 8+ languages, auto-detection, mid-call switching +- [x] Multilingual support: 13+ languages, auto-detection, mid-call switching - [x] Maya persona with full OWNER_CONTEXT customization - [x] TTS provider chain with circuit breaker (ElevenLabs → OpenAI → Polly) - [x] TTS disk cache (MD5-keyed, persistent) +- [x] Switch to gpt-4o-mini for conversation +- [x] Reduce max_tokens and system prompt +- [x] Groq LLaMA 3 support - [x] Contact lookup (local JSON + Twilio CNAM) - [x] Streaming GPT-4o with first-sentence TTS pipelining - [x] Twilio signature validation + rate limiting @@ -19,7 +22,7 @@ - [x] Call recording (on/off via Signal `/recording-on`) - [x] API usage & cost tracking per call (GPT tokens, TTS chars, estimated cost) - [x] `/stats` with session costs and per-call cost history -- [x] `speech_timeout` reduced to 2s (from 5s) +- [x] `speech_timeout` reduced to 1s (from 5s) --- @@ -31,20 +34,8 @@ _Nothing currently in progress._ ## Short-term (quick wins) -### Switch to gpt-4o-mini for conversation -- [ ] Set `OPENAI_MODEL=gpt-4o-mini` (env var, no code changes) -- [ ] Test quality for Polish/German/English phone conversations -- [ ] Benchmark latency improvement (~0.5-0.8s vs ~2s for gpt-4o) -- [ ] Keep gpt-4o for summarization only (separate model config) -- **Impact**: ~1.2s latency reduction, 10x cheaper tokens - -### Reduce max_tokens and system prompt -- [ ] Lower `max_tokens` from 350 to 150 (2-3 sentences is enough) -- [ ] Trim system prompt (remove redundant rules, shorten examples) -- **Impact**: ~0.3-0.5s latency reduction - ### Pre-warm TTS cache on startup -- [ ] Generate and cache all greetings (8 languages) at boot +- [ ] Generate and cache all greetings (13 languages) at boot - [ ] Cache all clarification phrases and no-input prompts - **Impact**: eliminates 1-2s TTS delay on first call per language @@ -52,14 +43,6 @@ _Nothing currently in progress._ ## Medium-term (new capabilities) -### Groq LLaMA 3 support -- [ ] Add Groq as alternative LLM backend (OpenAI-compatible API) -- [ ] Make LLM provider configurable via env var (`LLM_PROVIDER=openai|groq`) -- [ ] Test Polish/German quality on LLaMA 3 70B/405B -- [ ] Benchmark: expected ~0.2-0.3s inference time -- **Impact**: fastest possible LLM response, free tier available -- **Risk**: weaker multilingual support vs GPT - ### Configurable speech_timeout - [ ] Make `speech_timeout` configurable via env var - [ ] Consider per-language tuning (some languages have longer pauses) @@ -102,12 +85,12 @@ Current response time breakdown (user stops speaking → hears response): | Stage | Current | With gpt-4o-mini | With Groq | With Realtime API | |-------|---------|-------------------|-----------|-------------------| -| speech_timeout | 2.0s | 2.0s | 2.0s | N/A | +| speech_timeout | 1.0s | 1.0s | 1.0s | N/A | | Twilio STT | 0.5s | 0.5s | 0.5s | N/A | | LLM inference | 2.0s | 0.7s | 0.3s | — | | TTS generation | 1.5s | 1.5s | 1.5s | — | | Network/playback | 0.5s | 0.5s | 0.5s | — | -| **Total** | **~6.5s** | **~5.2s** | **~4.8s** | **~1-2s** | +| **Total** | **~5.5s** | **~4.2s** | **~3.8s** | **~1-2s** | _With complementary optimizations (shorter prompt, lower max_tokens, pre-warm cache): subtract ~0.5-1s._ diff --git a/docs/INSTALL_EN.md b/docs/INSTALL_EN.md index e746913..e82b9d7 100644 --- a/docs/INSTALL_EN.md +++ b/docs/INSTALL_EN.md @@ -657,7 +657,7 @@ AVA includes the following security mechanisms: │ │ │ │ │ │ │ owner_channel.py ─── contact_lookup.py ─── i18n.py │ │ │ │ │ │ │ │ -│ │ Signal notify contacts.json 11+ langs │ +│ │ Signal notify contacts.json 13+ langs │ │ │ Signal poll (3s) CNAM lookup Signal │ │ │ Slash commands Lang from prefix templates │ │ │ Owner instructions Per-contact lang │ diff --git a/docs/INSTALL_PL.md b/docs/INSTALL_PL.md index fbe8d98..68206df 100644 --- a/docs/INSTALL_PL.md +++ b/docs/INSTALL_PL.md @@ -655,7 +655,7 @@ AVA posiada nastepujace mechanizmy bezpieczenstwa: │ │ │ │ │ │ │ owner_channel.py ─── contact_lookup.py ─── i18n.py │ │ │ │ │ │ │ │ -│ │ Signal powiad. contacts.json 11+ jezykow │ +│ │ Signal powiad. contacts.json 13+ jezykow │ │ │ Signal poll (3s) CNAM lookup Signal │ │ │ Slash komendy Jezyk z prefiksu szablony │ │ │ Instrukcje Per-kontakt jezyk │