FluidInference · Alex-Wengg · Mar 13, 2026
diff --git a/models/tts/magpie/README.md b/models/tts/magpie/README.md
@@ -0,0 +1,253 @@
+# Magpie TTS 357M — CoreML Conversion
+
+Convert [NVIDIA Magpie TTS Multilingual 357M](https://huggingface.co/nvidia/magpie_tts_multilingual_357m) to CoreML for on-device iOS/macOS inference.
+
+## Architecture
+
+Magpie TTS is an encoder-decoder transformer that generates discrete audio codec tokens autoregressively, followed by NanoCodec vocoder for waveform synthesis.
+
+```
+Text → Text Encoder → Cross-Attention Conditioning
+                           ↓
+Speaker ID → Context Emb → Decoder (AR) → Codec Tokens → NanoCodec → Audio
+```
+
+**Key specs:**
+- 357M parameters total
+- 9 languages (en, es, de, fr, it, vi, zh, hi, ja)
+- 5 built-in speakers (John, Sofia, Aria, Jason, Leo)
+- 22kHz output audio
+- NanoCodec: 8 codebooks, 2016 codes each, 21.5fps
+
+## CoreML Model Split
+
+| Model | Purpose | Called |
+|-------|---------|-------|
+| `text_encoder.mlpackage` | Encode text → conditioning | Once per utterance |
+| `decoder_prefill.mlpackage` | Batch-prefill speaker context into KV cache | Once per utterance (optional) |
+| `decoder_step.mlpackage` | Single AR step with KV cache | ~50-200× per utterance |
+| `nanocodec_decoder.mlpackage` | Codec tokens → audio | Once per utterance |
+
+## Setup
+
+```bash
+uv sync
+
+# For NeMo (required for model extraction & tokenization):
+uv sync --extra nemo
+
+# For Mandarin pypinyin/jieba export:
+uv pip install pypinyin jieba
+```
+
+## Generation Steps
+
+All constants and models must be generated before running the iOS app or the Python generation pipeline. Run these in order:
+
+```bash
+# 1. Export constants (embeddings, speaker data, model config)
+#    Requires: NeMo
+#    Produces: constants/*.npy, constants/*.json
+python export_constants.py
+
+# 1b. Export local transformer weights as individual .npy files
+#     Requires: NeMo (or use --from-pt with previously extracted checkpoint)
+#     Produces: constants/local_transformer/*.npy
+python export_local_transformer.py
+
+# 2. Export tokenizer dictionaries (phoneme dicts, token2id, heteronyms)
+#    Requires: NeMo
+#    Produces: constants/*_token2id.json, *_phoneme_dict.json, *_heteronyms.json,
+#              mandarin_phoneme_*.json, japanese_phoneme_*.json,
+#              tokenizer_metadata.json, tokenizer_references.json
+python export_tokenizers.py
+
+# 3. Create English alias files expected by Swift EnglishTokenizer
+#    Requires: Step 2 complete
+#    Produces: english_phoneme_dict.json, english_token2id.json
+python export_tokenizer_aliases.py
+
+# 4. Export Mandarin pypinyin and jieba dictionaries
+#    Requires: pypinyin, jieba
+#    Produces: mandarin_pypinyin_char_dict.json, mandarin_pypinyin_phrase_dict.json,
+#              mandarin_jieba_dict.json
+python export_pypinyin.py
+
+# 5. Convert models to CoreML (.mlpackage)
+#    Requires: NeMo, coremltools
+#    Produces: build/*.mlpackage
+python convert/convert_text_encoder.py
+python convert/convert_decoder_prefill.py
+python convert/convert_decoder_step.py
+python convert/convert_nanocodec.py
+
+# 6. Compile for iOS/macOS (.mlpackage → .mlmodelc)
+#    Requires: Xcode command line tools
+#    Produces: compiled/*.mlmodelc
+python compile_mlmodelc.py
+
+# 7. Test generation on macOS
+python generate_coreml.py "Hello, this is Magpie TTS running on CoreML." --speaker 1
+```
+
+### CoreML Models
+
+| File | Required | Notes |
+|------|----------|-------|
+| `text_encoder.mlmodelc` | Yes | Falls back to `.mlpackage` if `.mlmodelc` absent |
+| `decoder_step.mlmodelc` | Yes | Falls back to `.mlpackage` |
+| `nanocodec_decoder.mlmodelc` | Yes | Falls back to `.mlpackage` |
+| `decoder_prefill.mlmodelc` | No | If absent, prefill runs step-by-step (slower) |
+
+### Core Constants (always loaded)
+
+| File | Generated by |
+|------|-------------|
+| `constants.json` | `export_constants.py` |
+| `speaker_info.json` | `export_constants.py` |
+| `tokenizer_metadata.json` | `export_tokenizers.py` |
+| `speaker_0.npy` .. `speaker_4.npy` | `export_constants.py` |
+| `audio_embedding_0.npy` .. `audio_embedding_7.npy` | `export_constants.py` |
+
+### Local Transformer Weights (always loaded)
+
+All files in `constants/local_transformer/`, generated by `export_local_transformer.py`:
+
+| File | Shape |
+|------|-------|
+| `in_proj_weight.npy` | (256, 768) |
+| `in_proj_bias.npy` | (256,) |
+| `pos_emb.npy` | (10, 256) |
+| `norm1_weight.npy`, `norm2_weight.npy` | (256,) |
+| `sa_qkv_weight.npy` | (768, 256) |
+| `sa_o_weight.npy` | (256, 256) |
+| `ffn_conv1_weight.npy` | (1024, 256, 1) |
+| `ffn_conv2_weight.npy` | (256, 1024, 1) |
+| `out_proj_{0-7}_weight.npy` | 8 x (2024, 256) |
+| `out_proj_{0-7}_bias.npy` | 8 x (2024,) |
+
+### Language Tokenizer Data (lazy-loaded per language)
+
+#### English
+| File | Generated by |
+|------|-------------|
+| `english_phoneme_dict.json` | `export_tokenizer_aliases.py` (alias of `english_phoneme_phoneme_dict.json`) |
+| `english_token2id.json` | `export_tokenizer_aliases.py` (alias of `english_phoneme_token2id.json`) |
+
+#### Spanish
+| File | Generated by |
+|------|-------------|
+| `spanish_phoneme_phoneme_dict.json` | `export_tokenizers.py` |
+| `spanish_phoneme_token2id.json` | `export_tokenizers.py` |
+
+#### German
+| File | Generated by |
+|------|-------------|
+| `german_phoneme_phoneme_dict.json` | `export_tokenizers.py` |
+| `german_phoneme_token2id.json` | `export_tokenizers.py` |
+| `german_phoneme_heteronyms.json` | `export_tokenizers.py` (optional, graceful fallback) |
+
+#### Hindi
+| File | Generated by |
+|------|-------------|
+| `hindi_chartokenizer_token2id.json` | `export_tokenizers.py` |
+
+#### Mandarin
+| File | Generated by |
+|------|-------------|
+| `mandarin_jieba_dict.json` | `extras/export_pypinyin.py` |
+| `mandarin_pypinyin_char_dict.json` | `extras/export_pypinyin.py` |
+| `mandarin_pypinyin_phrase_dict.json` | `extras/export_pypinyin.py` |
+| `mandarin_phoneme_pinyin_dict.json` | `export_tokenizers.py` |
+| `mandarin_phoneme_tone_dict.json` | `export_tokenizers.py` |
+| `mandarin_phoneme_ascii_letter_dict.json` | `export_tokenizers.py` |
+| `mandarin_phoneme_token2id.json` | `export_tokenizers.py` |
+
+#### French / Italian / Vietnamese
+No data files needed — these use ByT5 byte-level encoding (purely algorithmic).
+
+#### Japanese
+| File | Generated by |
+|------|-------------|
+| `japanese_phoneme_token2id.json` | `export_tokenizers.py` |
+| `japanese_phoneme_punctuation.json` | `export_tokenizers.py` |
+| `japanese_phoneme_ascii_letter_dict.json` | `export_tokenizers.py` |
+| `open_jtalk_dic/` | [OpenJTalk dictionary](https://sourceforge.net/projects/open-jtalk/files/Dictionary/open_jtalk_dic-1.11/) (UTF-8) |
+
+Japanese G2P uses [OpenJTalk](https://github.com/r9y9/open_jtalk) compiled as a static library via XCFramework. See [Building OpenJTalk](#building-openjtalk) below.
+
+## Python File Layout
+
+### Pipeline Scripts (root)
+
+| Script | Purpose |
+|--------|---------|
+| `export_constants.py` | Export embeddings, speaker data, model config |
+| `export_local_transformer.py` | Export local transformer weights as individual .npy files |
+| `export_tokenizers.py` | Export per-language tokenizer dictionaries from NeMo |
+| `export_tokenizer_aliases.py` | Create short-name English files expected by Swift |
+| `extract_models.py` | Inspect architecture or extract components from .nemo checkpoint |
+| `generate_coreml.py` | End-to-end TTS generation using converted CoreML models |
+| `compile_mlmodelc.py` | Compile `.mlpackage` → `.mlmodelc` via `xcrun coremlcompiler` |
+
+### Conversion Scripts (`convert/`)
+
+| Script | Purpose |
+|--------|---------|
+| `convert_text_encoder.py` | Convert text encoder to CoreML |
+| `convert_decoder_prefill.py` | Convert batch context prefill to CoreML |
+| `convert_decoder_step.py` | Convert single-step decoder (with KV cache) to CoreML |
+| `convert_nanocodec.py` | Convert NanoCodec vocoder to CoreML |
+
+### Traceable Wrappers (`traceable/`)
+
+PyTorch `nn.Module` wrappers that make NeMo model components traceable for CoreML conversion. They replace implicit state (e.g. in-place KV cache updates) with explicit tensor I/O.
+
+## Building OpenJTalk
+
+Japanese text input requires the OpenJTalk G2P library, compiled as a static XCFramework. The build script handles cloning, cross-compiling, and packaging.
+
+### Prerequisites
+
+- Xcode with iOS SDK installed
+- CMake (`brew install cmake`)
+
+### Build
+
+```bash
+./build_openjtalk.sh
+```
+
+This produces:
+- `ios/OpenJTalk.xcframework/` — static libraries for iOS device, iOS simulator, and macOS (arm64)
+- `ios/COpenJTalk/` — Swift module map (`module.modulemap` + header)
+
+The XCFramework is linked manually via `LIBRARY_SEARCH_PATHS` and `SWIFT_INCLUDE_PATHS` in `project.yml` (not as an Xcode framework dependency) to avoid module map collisions with the NemoTextProcessing XCFramework.
+
+### MeCab Dictionary
+
+Japanese tokenization also requires the OpenJTalk MeCab dictionary. Download and extract to `ios/constants/open_jtalk_dic/`:
+
+```bash
+curl -L "https://sourceforge.net/projects/open-jtalk/files/Dictionary/open_jtalk_dic-1.11/open_jtalk_dic_utf_8-1.11.tar.gz/download" | tar xz
+mv open_jtalk_dic_utf_8-1.11/* ios/constants/open_jtalk_dic/
+```
+
+The dictionary (~102MB) contains `sys.dic`, `matrix.bin`, `char.bin`, `unk.dic`, and related files needed by MeCab's morphological analyzer.
+
+## Speakers
+
+| Index | Name | Description |
+|-------|------|-------------|
+| 0 | John | Male (LibriVox, public domain) |
+| 1 | Sofia | Female (proprietary) |
+| 2 | Aria | Female (proprietary) |
+| 3 | Jason | Male (proprietary) |
+| 4 | Leo | Male (proprietary) |
+
+## References
+
+- [Magpie TTS Model Card](https://huggingface.co/nvidia/magpie_tts_multilingual_357m)
+- [NanoCodec](https://huggingface.co/nvidia/nemo-nano-codec-22khz-1.89kbps-21.5fps)
+- [NeMo Framework](https://github.com/NVIDIA/NeMo)
+- [NVIDIA Open Model License](https://developer.download.nvidia.com/licenses/nvidia-open-model-license-agreement-june-2024.pdf)