A high-quality Flow-Matching based Text-to-Speech library using ONNX. This is a Python port of the Supertonic-2 web implementation.
- 10 Unique Voice Styles: Professional male and female voices.
- Auto-Downloader: Automatically fetches models from HuggingFace to a global cache (
~/.cache/ebook_reader_supertonic). - Word Timestamps: Heuristic estimation by default, with optional Vosk-based extraction (offline ASR) for better word timing.
- Adjustable Parameters: Control speed (0.9 - 1.4) and diffusion steps (3 - 14).
- Lightweight Inference: Runs on CPU/GPU via ONNX Runtime.
pip install ebook-reader-supertonicfrom ebook_reader_supertonic import SupertonicTTS, VOICE_STYLES, MIN_SPEED, MAX_SPEED
# 1. Initialize engine
# Models are automatically cached in ~/.cache/ebook_reader_supertonic
engine = SupertonicTTS()
# 2. Synthesize
# Returns:
# - audio: np.ndarray (float32, normalized -1 to 1)
# - sample_rate: int (44100)
# - word_timestamps: List[Dict] -> [{'word': str, 'start': float, 'end': float}]
audio, sr, word_timestamps = engine.synthesize(
text="Hello! Welcome to ebook-reader-supertonic.",
voice='F5',
speed=1.0,
steps=10,
# timestamps_backend="auto", # 'estimate' (default), 'vosk', or 'auto'
# vosk_model_path="path/to/vosk/model", # or set env VOSK_MODEL_PATH
)If you use timestamps_backend="auto" or "vosk" and no model path is configured, the package can auto-download the pinned model vosk-model-en-us-0.22-lgraph into ~/.cache/vosk.
Environment variables:
VOSK_MODEL_PATH: use an existing local model directory (disables download).VOSK_CACHE_DIR: override cache base (default~/.cache/vosk).VOSK_OFFLINE=1: forbid downloads (error for"vosk", fallback to estimate for"auto").EBOOK_READER_VOSK_AUTO_DOWNLOAD=0: disable auto-download behavior.
duration = len(audio) / sr print(f"Generated {duration:.2f}s of audio")
for segment in word_timestamps: print(f"{segment['word']}: {segment['start']}s -> {segment['end']}s")
engine.save_wav(audio, "output.wav")
## API Reference
### `SupertonicTTS.synthesize(text, voice='M3', steps=10, speed=1.0, lang=None)`
- **Parameters**:
- `text` (str): Text to synthesize.
- `voice` (str): Voice ID (`F1-F5`, `M1-M5`).
- `steps` (int): Diffusion steps (`MIN_STEPS=3` to `MAX_STEPS=14`).
- `speed` (float): Speed factor (`MIN_SPEED=0.9` to `MAX_SPEED=1.4`).
- `lang` (str): Manual language override (e.g., 'en', 'ko'). Auto-detects if None.
- **Returns**: `(audio_data, sample_rate, word_timestamps)`
### `VOICE_STYLES`
A list of Pydantic models containing voice metadata:
```python
voice = VOICE_STYLES[0]
print(voice.id) # 'F1'
print(voice.gender) # 'female'
print(voice.description) # 'Correct and natural...'
Izzet Sezer sezer@imsezer.com