Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
253 changes: 253 additions & 0 deletions models/tts/magpie/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,253 @@
# Magpie TTS 357M — CoreML Conversion

Convert [NVIDIA Magpie TTS Multilingual 357M](https://huggingface.co/nvidia/magpie_tts_multilingual_357m) to CoreML for on-device iOS/macOS inference.

## Architecture

Magpie TTS is an encoder-decoder transformer that generates discrete audio codec tokens autoregressively, followed by NanoCodec vocoder for waveform synthesis.

```
Text → Text Encoder → Cross-Attention Conditioning
Speaker ID → Context Emb → Decoder (AR) → Codec Tokens → NanoCodec → Audio
```

**Key specs:**
- 357M parameters total
- 9 languages (en, es, de, fr, it, vi, zh, hi, ja)
- 5 built-in speakers (John, Sofia, Aria, Jason, Leo)
- 22kHz output audio
- NanoCodec: 8 codebooks, 2016 codes each, 21.5fps

## CoreML Model Split

| Model | Purpose | Called |
|-------|---------|-------|
| `text_encoder.mlpackage` | Encode text → conditioning | Once per utterance |
| `decoder_prefill.mlpackage` | Batch-prefill speaker context into KV cache | Once per utterance (optional) |
| `decoder_step.mlpackage` | Single AR step with KV cache | ~50-200× per utterance |
| `nanocodec_decoder.mlpackage` | Codec tokens → audio | Once per utterance |

## Setup

```bash
uv sync

# For NeMo (required for model extraction & tokenization):
uv sync --extra nemo

# For Mandarin pypinyin/jieba export:
uv pip install pypinyin jieba
```

## Generation Steps

All constants and models must be generated before running the iOS app or the Python generation pipeline. Run these in order:

```bash
# 1. Export constants (embeddings, speaker data, model config)
# Requires: NeMo
# Produces: constants/*.npy, constants/*.json
python export_constants.py

# 1b. Export local transformer weights as individual .npy files
# Requires: NeMo (or use --from-pt with previously extracted checkpoint)
# Produces: constants/local_transformer/*.npy
python export_local_transformer.py

# 2. Export tokenizer dictionaries (phoneme dicts, token2id, heteronyms)
# Requires: NeMo
# Produces: constants/*_token2id.json, *_phoneme_dict.json, *_heteronyms.json,
# mandarin_phoneme_*.json, japanese_phoneme_*.json,
# tokenizer_metadata.json, tokenizer_references.json
python export_tokenizers.py

# 3. Create English alias files expected by Swift EnglishTokenizer
# Requires: Step 2 complete
# Produces: english_phoneme_dict.json, english_token2id.json
python export_tokenizer_aliases.py

# 4. Export Mandarin pypinyin and jieba dictionaries
# Requires: pypinyin, jieba
# Produces: mandarin_pypinyin_char_dict.json, mandarin_pypinyin_phrase_dict.json,
# mandarin_jieba_dict.json
python export_pypinyin.py

# 5. Convert models to CoreML (.mlpackage)
# Requires: NeMo, coremltools
# Produces: build/*.mlpackage
python convert/convert_text_encoder.py
python convert/convert_decoder_prefill.py
python convert/convert_decoder_step.py
python convert/convert_nanocodec.py

# 6. Compile for iOS/macOS (.mlpackage → .mlmodelc)
# Requires: Xcode command line tools
# Produces: compiled/*.mlmodelc
python compile_mlmodelc.py

# 7. Test generation on macOS
python generate_coreml.py "Hello, this is Magpie TTS running on CoreML." --speaker 1
```

### CoreML Models

| File | Required | Notes |
|------|----------|-------|
| `text_encoder.mlmodelc` | Yes | Falls back to `.mlpackage` if `.mlmodelc` absent |
| `decoder_step.mlmodelc` | Yes | Falls back to `.mlpackage` |
| `nanocodec_decoder.mlmodelc` | Yes | Falls back to `.mlpackage` |
| `decoder_prefill.mlmodelc` | No | If absent, prefill runs step-by-step (slower) |

### Core Constants (always loaded)

| File | Generated by |
|------|-------------|
| `constants.json` | `export_constants.py` |
| `speaker_info.json` | `export_constants.py` |
| `tokenizer_metadata.json` | `export_tokenizers.py` |
| `speaker_0.npy` .. `speaker_4.npy` | `export_constants.py` |
| `audio_embedding_0.npy` .. `audio_embedding_7.npy` | `export_constants.py` |

### Local Transformer Weights (always loaded)

All files in `constants/local_transformer/`, generated by `export_local_transformer.py`:

| File | Shape |
|------|-------|
| `in_proj_weight.npy` | (256, 768) |
| `in_proj_bias.npy` | (256,) |
| `pos_emb.npy` | (10, 256) |
| `norm1_weight.npy`, `norm2_weight.npy` | (256,) |
| `sa_qkv_weight.npy` | (768, 256) |
| `sa_o_weight.npy` | (256, 256) |
| `ffn_conv1_weight.npy` | (1024, 256, 1) |
| `ffn_conv2_weight.npy` | (256, 1024, 1) |
| `out_proj_{0-7}_weight.npy` | 8 x (2024, 256) |
| `out_proj_{0-7}_bias.npy` | 8 x (2024,) |

### Language Tokenizer Data (lazy-loaded per language)

#### English
| File | Generated by |
|------|-------------|
| `english_phoneme_dict.json` | `export_tokenizer_aliases.py` (alias of `english_phoneme_phoneme_dict.json`) |
| `english_token2id.json` | `export_tokenizer_aliases.py` (alias of `english_phoneme_token2id.json`) |

#### Spanish
| File | Generated by |
|------|-------------|
| `spanish_phoneme_phoneme_dict.json` | `export_tokenizers.py` |
| `spanish_phoneme_token2id.json` | `export_tokenizers.py` |

#### German
| File | Generated by |
|------|-------------|
| `german_phoneme_phoneme_dict.json` | `export_tokenizers.py` |
| `german_phoneme_token2id.json` | `export_tokenizers.py` |
| `german_phoneme_heteronyms.json` | `export_tokenizers.py` (optional, graceful fallback) |

#### Hindi
| File | Generated by |
|------|-------------|
| `hindi_chartokenizer_token2id.json` | `export_tokenizers.py` |

#### Mandarin
| File | Generated by |
|------|-------------|
| `mandarin_jieba_dict.json` | `extras/export_pypinyin.py` |
| `mandarin_pypinyin_char_dict.json` | `extras/export_pypinyin.py` |
| `mandarin_pypinyin_phrase_dict.json` | `extras/export_pypinyin.py` |
| `mandarin_phoneme_pinyin_dict.json` | `export_tokenizers.py` |
| `mandarin_phoneme_tone_dict.json` | `export_tokenizers.py` |
| `mandarin_phoneme_ascii_letter_dict.json` | `export_tokenizers.py` |
| `mandarin_phoneme_token2id.json` | `export_tokenizers.py` |

#### French / Italian / Vietnamese
No data files needed — these use ByT5 byte-level encoding (purely algorithmic).

#### Japanese
| File | Generated by |
|------|-------------|
| `japanese_phoneme_token2id.json` | `export_tokenizers.py` |
| `japanese_phoneme_punctuation.json` | `export_tokenizers.py` |
| `japanese_phoneme_ascii_letter_dict.json` | `export_tokenizers.py` |
| `open_jtalk_dic/` | [OpenJTalk dictionary](https://sourceforge.net/projects/open-jtalk/files/Dictionary/open_jtalk_dic-1.11/) (UTF-8) |

Japanese G2P uses [OpenJTalk](https://github.com/r9y9/open_jtalk) compiled as a static library via XCFramework. See [Building OpenJTalk](#building-openjtalk) below.

## Python File Layout

### Pipeline Scripts (root)

| Script | Purpose |
|--------|---------|
| `export_constants.py` | Export embeddings, speaker data, model config |
| `export_local_transformer.py` | Export local transformer weights as individual .npy files |
| `export_tokenizers.py` | Export per-language tokenizer dictionaries from NeMo |
| `export_tokenizer_aliases.py` | Create short-name English files expected by Swift |
| `extract_models.py` | Inspect architecture or extract components from .nemo checkpoint |
| `generate_coreml.py` | End-to-end TTS generation using converted CoreML models |
| `compile_mlmodelc.py` | Compile `.mlpackage` → `.mlmodelc` via `xcrun coremlcompiler` |

### Conversion Scripts (`convert/`)

| Script | Purpose |
|--------|---------|
| `convert_text_encoder.py` | Convert text encoder to CoreML |
| `convert_decoder_prefill.py` | Convert batch context prefill to CoreML |
| `convert_decoder_step.py` | Convert single-step decoder (with KV cache) to CoreML |
| `convert_nanocodec.py` | Convert NanoCodec vocoder to CoreML |

### Traceable Wrappers (`traceable/`)

PyTorch `nn.Module` wrappers that make NeMo model components traceable for CoreML conversion. They replace implicit state (e.g. in-place KV cache updates) with explicit tensor I/O.

## Building OpenJTalk

Japanese text input requires the OpenJTalk G2P library, compiled as a static XCFramework. The build script handles cloning, cross-compiling, and packaging.

### Prerequisites

- Xcode with iOS SDK installed
- CMake (`brew install cmake`)

### Build

```bash
./build_openjtalk.sh
```

This produces:
- `ios/OpenJTalk.xcframework/` — static libraries for iOS device, iOS simulator, and macOS (arm64)
- `ios/COpenJTalk/` — Swift module map (`module.modulemap` + header)

The XCFramework is linked manually via `LIBRARY_SEARCH_PATHS` and `SWIFT_INCLUDE_PATHS` in `project.yml` (not as an Xcode framework dependency) to avoid module map collisions with the NemoTextProcessing XCFramework.

### MeCab Dictionary

Japanese tokenization also requires the OpenJTalk MeCab dictionary. Download and extract to `ios/constants/open_jtalk_dic/`:

```bash
curl -L "https://sourceforge.net/projects/open-jtalk/files/Dictionary/open_jtalk_dic-1.11/open_jtalk_dic_utf_8-1.11.tar.gz/download" | tar xz
mv open_jtalk_dic_utf_8-1.11/* ios/constants/open_jtalk_dic/
```

The dictionary (~102MB) contains `sys.dic`, `matrix.bin`, `char.bin`, `unk.dic`, and related files needed by MeCab's morphological analyzer.

## Speakers

| Index | Name | Description |
|-------|------|-------------|
| 0 | John | Male (LibriVox, public domain) |
| 1 | Sofia | Female (proprietary) |
| 2 | Aria | Female (proprietary) |
| 3 | Jason | Male (proprietary) |
| 4 | Leo | Male (proprietary) |

## References

- [Magpie TTS Model Card](https://huggingface.co/nvidia/magpie_tts_multilingual_357m)
- [NanoCodec](https://huggingface.co/nvidia/nemo-nano-codec-22khz-1.89kbps-21.5fps)
- [NeMo Framework](https://github.com/NVIDIA/NeMo)
- [NVIDIA Open Model License](https://developer.download.nvidia.com/licenses/nvidia-open-model-license-agreement-june-2024.pdf)
Loading