Base URL: http://127.0.0.1:8000
Kokoro WebUI provides two API groups:
- Native API - Full-featured endpoints under
/api/*andws://.../ws/* - OpenAI-compatible API - Drop-in replacement for OpenAI's
/v1/audio/speech
| Endpoint | Method | Description |
|---|---|---|
/api/health |
GET | Check server status and queue |
/api/capabilities |
GET | List voices, formats, and limits |
/api/speak |
POST | Generate single audio file |
/api/speak-stream |
POST | Stream audio as NDJSON |
/ws/speak-stream |
WS | Stream audio over WebSocket |
/v1/audio/speech |
POST | OpenAI-compatible endpoint |
For practical examples, see examples/README.md.
Authentication is optional. When KOKORO_REQUIRE_AUTH=1 is set:
- HTTP routes accept
Authorization: Bearer <key>orX-API-Key: <key> - WebSocket accepts the bearer header during handshake
- The built-in Web UI uses short-lived session tokens instead of the API key
- Query string API keys are not supported (security)
Failed auth attempts are rate-limited per client. After too many failures, you'll get 429 Too Many Requests with a Retry-After header.
See CONFIGURATION.md for details on setting up authentication.
These fields work across most native synthesis endpoints:
| Field | Type | Required | Notes |
|---|---|---|---|
text |
string | yes | 1 to 2500 characters |
voice |
string | no | Default: af_heart |
speed |
number | no | 0.5 to 1.8, default 1.0 |
pitch |
number | no | -6.0 to 6.0 semitones, default 0.0 |
format |
string | no | pcm, wav, or opus (server-dependent) |
opus_bitrate |
string | no | 16k, 24k, 32k, or 48k |
wav_sample_rate |
string | no | native, 16000, 22050, 24000, 44100, 48000 |
Notes:
pitch: 0.0skips pitch processing entirely (faster)- Pitch shifting requires ffmpeg with rubberband filter
- Check
/api/capabilitiesto see which formats your server supports
Returns server status, queue information, and runtime health.
Example response:
{
"ok": true,
"missing": [],
"active_provider": "CPUExecutionProvider",
"gpu": {
"available": false,
"process_vram_used_mb": 0
},
"queue": {
"worker_limit": 2,
"queue_limit": 8,
"active_jobs": 0,
"queued_jobs": 0,
"available_slots": 10
}
}Key fields:
ok-truewhen everything is workingmissing- List of missing dependencies or assetsactive_provider- Current runtime (CPU, CUDA, etc.)queue- Current load and capacity
Returns detailed information about available features.
Example response:
{
"voices": ["af_heart", "af_sarah", "am_michael"],
"formats": ["wav", "opus", "pcm"],
"pitch_shifting": true,
"synthesis_workers": 2,
"websocket_streaming": true
}Use this to populate UI controls or validate requests.
Generate a single audio file from text.
Request:
{
"text": "Hello world",
"voice": "af_heart",
"speed": 1.0,
"pitch": 0.0,
"format": "wav"
}Response:
Returns raw audio bytes:
audio/pcmforformat: pcmaudio/wavforformat: wavaudio/oggforformat: opus
Response headers include metadata:
| Header | Description |
|---|---|
X-Audio-Format |
Format used |
X-Sample-Rate |
Sample rate in Hz |
X-Audio-Duration |
Duration in seconds |
Errors:
400- Invalid request (bad voice, format not supported, etc.)503- Server overloaded (queue full)
Stream audio chunks as they're generated using NDJSON (Newline Delimited JSON).
Request:
Same as /api/speak, plus optional:
| Field | Type | Default | Description |
|---|---|---|---|
target_chunk_chars |
integer | 360 | Target characters per chunk (80-2000) |
Response:
Content-Type: application/x-ndjson
Each line is a JSON object with a type field:
Meta message (first):
{
"type": "meta",
"total_chunks": 3,
"format": "wav"
}Chunk message (one per audio chunk):
{
"type": "chunk",
"chunk_index": 0,
"total_chunks": 3,
"text": "First sentence.",
"duration_sec": 1.28,
"audio_base64": "..."
}Error message (if a chunk fails):
{
"type": "error",
"detail": "Pitch shifting failed",
"chunk_index": 1
}Done message (last):
{
"type": "done",
"total_chunks": 3
}WebSocket endpoint for real-time streaming.
Usage:
- Connect to
ws://127.0.0.1:8000/ws/speak-stream - Send a JSON message with synthesis parameters
- Receive messages in the same format as NDJSON streaming
Example client payload:
{
"text": "WebSocket streaming example",
"voice": "af_heart",
"format": "wav",
"target_chunk_chars": 360
}WebSocket is ideal for real-time applications where you want to start playing audio before the full text is synthesized.
Kokoro WebUI implements the OpenAI audio speech API, making it a drop-in replacement for applications that use OpenAI's TTS.
Base URL: http://127.0.0.1:8000/v1
OpenAI-compatible speech generation.
Request:
{
"model": "kokoro",
"input": "Hello from the compatible endpoint",
"voice": "af_heart",
"response_format": "wav",
"speed": 1.0
}Fields:
| Field | Type | Required | Notes |
|---|---|---|---|
model |
string | yes | Ignored (accepted for compatibility) |
input |
string | yes | Text to synthesize (1-4096 chars) |
voice |
string | yes | Voice ID, or with pitch suffix like af_heart+2.0 |
response_format |
string | no | pcm, wav, or opus |
speed |
number | no | 0.25-4.0 (but Kokoro only supports 0.5-1.8) |
Pitch with OpenAI API:
The OpenAI API doesn't have a pitch field, so we use voice suffixes:
af_heart- Normal pitch (0.0)af_heart+2.0- Raise pitch 2 semitonesaf_heart-1.5- Lower pitch 1.5 semitones
Suffix must be between -6.0 and +6.0.
Response:
Returns raw audio bytes:
wavandpcmstream progressively (chunked transfer encoding)opusreturns after full render
Headers:
| Header | Description |
|---|---|
X-OpenAI-Compatible |
Always kokoro |
X-Audio-Format |
Format used |
Errors:
Returns OpenAI-compatible error format:
{
"error": {
"message": "speed must be between 0.5 and 1.8 for Kokoro",
"type": "invalid_request_error"
}
}Returns available models (just kokoro).
Response:
{
"object": "list",
"data": [
{
"id": "kokoro",
"object": "model",
"owned_by": "kokoro-webui"
}
]
}Returns metadata for a specific model.
Response:
{
"id": "kokoro",
"object": "model",
"owned_by": "kokoro-webui"
}All endpoints return consistent error formats:
Native API errors:
{
"detail": "Voice 'invalid_voice' not found"
}OpenAI-compatible errors:
{
"error": {
"message": "Voice 'invalid_voice' not found",
"type": "invalid_request_error",
"param": null,
"code": null
}
}Common status codes:
| Code | Meaning |
|---|---|
| 400 | Bad request (invalid parameters) |
| 401 | Unauthorized (auth required but missing) |
| 429 | Too many requests (rate limited) |
| 503 | Service unavailable (queue full) |
from openai import OpenAI
client = OpenAI(
base_url="http://127.0.0.1:8000/v1",
api_key="not-needed" # Or your API key if auth is enabled
)
response = client.audio.speech.create(
model="kokoro",
input="Hello world",
voice="af_heart",
response_format="wav"
)
response.stream_to_file("output.wav")from langchain_community.tools import OpenAITTS
tts = OpenAITTS(
base_url="http://127.0.0.1:8000/v1",
api_key="not-needed"
)
audio = tts.invoke({
"input": "Hello from LangChain",
"voice": "af_heart"
})For complete examples in multiple languages (curl, Python, JavaScript), see examples/README.md.
While we aim for compatibility, there are some differences:
- No SSE streaming -
stream_format: "sse"returns an error - Pitch via suffix - Use
voice+2.0instead of a separate pitch field - Speed limits - Kokoro only supports 0.5x to 1.8x speed
- Model ignored - The
modelfield is accepted but ignored - Native features missing - No access to
lang,opus_bitrate, ortarget_chunk_charsvia OpenAI endpoints
For full access to all features, use the native /api/* endpoints.