Health check endpoint.
Response:
{
"status": "ok",
"version": "0.1.0",
"is_jetson": false,
"power_mode": null,
"tts_backend": "qwen",
"stt_backend": null
}Get detailed server information.
Response:
{
"is_jetson": false,
"power_mode": null,
"tts": {
"name": "qwen",
"loaded": true,
"supports_streaming": true
},
"stt": {
"loaded": false
}
}List available TTS backends.
Response:
{
"backends": [
{
"name": "qwen",
"loaded": true,
"supports_streaming": true,
"supports_voice_cloning": true
},
{
"name": "piper",
"loaded": false,
"supports_streaming": true,
"supports_voice_cloning": false
}
]
}Load a TTS backend.
Request:
{
"model_size": "0.6B",
"device": "cuda"
}Response:
{
"success": true,
"message": "Loaded TTS backend: qwen",
"backend": {
"name": "qwen",
"loaded": true
}
}Get available voices for current backend.
Response:
{
"voices": [
{
"id": "ryan",
"name": "Ryan",
"language": "multilingual",
"gender": "male",
"description": "Neutral (default)"
},
{
"id": "serena",
"name": "Serena",
"language": "multilingual",
"gender": "female",
"description": "Warm"
}
],
"languages": ["English", "Chinese", "Japanese"]
}Synthesize speech from text. Returns WAV audio.
Request:
{
"text": "Hello world",
"voice": "ryan",
"language": "English",
"temperature": 1.0
}Response Headers:
Content-Type: audio/wav
X-Duration: 1.5
X-Sample-Rate: 24000
X-Voice: ryan
Response Body: WAV audio binary data
Synthesize speech and return base64 audio.
Request:
{
"text": "Hello world",
"voice": "ryan"
}Response:
{
"success": true,
"duration": 1.5,
"sample_rate": 24000,
"voice": "ryan",
"audio_base64": "UklGRi..."
}List available STT backends.
Response:
{
"backends": [
{
"name": "whisper",
"loaded": false,
"supports_streaming": false
}
]
}Load an STT backend.
Request:
{
"model_size": "base",
"device": "cuda"
}Get supported languages.
Response:
{
"languages": ["en", "zh", "ja", "ko", "de", "fr", "es"]
}Transcribe audio file.
Request: multipart/form-data
audio: Audio file (WAV, MP3, etc.)language: Language code (optional)
Response:
{
"success": true,
"text": "Hello world",
"language": "en",
"duration": 1.5,
"segments": [
{
"text": "Hello world",
"start": 0.0,
"end": 1.5,
"confidence": 0.95
}
]
}Endpoint: ws://localhost:8080/tts/stream
Client → Server:
{
"text": "Hello. This is streaming.",
"voice": "ryan",
"language": "English"
}Server → Client (start):
{
"type": "start",
"chunks": 2
}Server → Client (audio):
{
"type": "audio",
"chunk": 1,
"data": "UklGRi...",
"duration": 0.8
}Server → Client (done):
{
"type": "done",
"total_time": 2.5
}Server → Client (error):
{
"type": "error",
"error": "Model not loaded"
}Endpoint: ws://localhost:8080/stt/stream
Client → Server (start):
{
"type": "start",
"language": "en",
"sample_rate": 16000
}Server → Client (ready):
{
"type": "ready"
}Client → Server: Binary audio chunks (16-bit PCM)
Server → Client (segment):
{
"type": "segment",
"text": "Hello",
"is_final": false
}Client → Server (stop):
{
"type": "stop"
}Server → Client (done):
{
"type": "done",
"text": "Hello world"
}from jetson_assistant import Engine
# Create engine
engine = Engine()
# Load TTS backend
engine.load_tts_backend("qwen", model_size="0.6B")
# Synthesize
result = engine.synthesize("Hello world", voice="serena")
result.save("output.wav")
# Or play directly
engine.say("Hello world")# Load STT backend
engine.load_stt_backend("whisper", model_size="base")
# Transcribe file
result = engine.transcribe("audio.wav")
print(result.text)
# Transcribe numpy array
import numpy as np
audio = np.zeros(16000, dtype=np.int16)
result = engine.transcribe(audio, sample_rate=16000)# Stream synthesis
for chunk in engine.synthesize_stream("Long text here"):
# Process each chunk
print(f"Chunk: {chunk.duration}s")
# Stream and play
engine.say("Long text", stream=True)# Convert document to audio
result = engine.synthesize_file(
"document.pdf",
output="document.wav",
voice="ryan"
)