FluidInference · JarbasAl · Mar 16, 2026
diff --git a/models/stt/parakeet-tdt-ctc-110m/coreml/README.md b/models/stt/parakeet-tdt-ctc-110m/coreml/README.md
@@ -1,17 +1,21 @@
-# Parakeet‑TDT CTC 110M — CoreML Export (Mel+Encoder+CTC)
+# Parakeet-TDT-CTC 110M — CoreML Export
 
-This target contains tools to export NVIDIA's `nvidia/parakeet-tdt_ctc-110m` hybrid RNNT/CTC model to CoreML, with an emphasis on the **CTC branch** used for keyword spotting and custom vocabulary support.
+This directory contains tools to export NVIDIA's `nvidia/parakeet-tdt_ctc-110m` hybrid RNNT/CTC model to CoreML.
 
-The goal is to produce a CoreML bundle that mirrors NeMo's preprocessing and CTC head, so that Swift-side CTC keyword spotting can rely on **model-correct** acoustics rather than the Argmax MelSpectrogram export.
+The hybrid model has two decoder heads sharing one encoder:
+- **TDT (Token Duration Transducer)**: Primary transcription head with duration prediction — used by FluidAudio for ASR
+- **CTC**: Auxiliary head for keyword spotting and custom vocabulary support
 
 ## Layout
 
 ```text
 mobius/models/stt/parakeet-tdt-ctc-110m/coreml
-├── README.md              # This file
-├── convert-coreml.py      # CLI for exporting fused Mel+Encoder+CTC to CoreML
-├── pyproject.toml         # Per-target Python environment (NeMo + coremltools)
-└── audio/                 # Optional trace audio (15s 16kHz) for export
+├── README.md                  # This file
+├── convert-tdt-coreml.py      # TDT export: fused mel+encoder, RNNT decoder, joint decision
+├── convert-coreml.py          # CTC export: fused mel+encoder+CTC head (for keyword spotting)
+├── individual_components.py   # Shared torch.nn.Module wrappers for CoreML tracing
+├── pyproject.toml             # Per-target Python environment (NeMo + coremltools)
+└── audio/                     # Optional trace audio (15s 16kHz) for export
 ```
 
 ## Environment
@@ -24,43 +28,75 @@ uv sync
 
 This will create/update a local environment pinned by `pyproject.toml` and `uv.lock` (Python 3.10.12, NeMo, coremltools, etc.).
 
-## Export CoreML (Mel+Encoder+CTC)
+## Export TDT (for FluidAudio ASR)
 
-The `convert-coreml.py` script exports a fused module:
+The `convert-tdt-coreml.py` script exports the TDT components used by FluidAudio:
 
 ```text
-audio_signal [1, S] (16kHz mono, 15s window)
-    ↓ preprocessor (NeMo)
-mel spectrogram
-    ↓ encoder (NeMo)
-encoder features
-    ↓ ctc_decoder (NeMo aux_ctc branch)
-log_probs [1, T, V+1] (CTC log-probabilities)
+Preprocessor.mlpackage     — fused waveform → mel → encoder features
+Decoder.mlpackage          — RNNT prediction network (LSTM)
+JointDecision.mlpackage    — joint network (full T×U grid, with TDT duration)
+JointDecisionSingleStep.mlpackage — single-step joint (for streaming)
+vocab.json                 — SentencePiece vocabulary (array format)
+metadata.json              — model dimensions and export configuration
 ```
 
 Usage:
 
 ```bash
-uv run python convert-coreml.py convert \
-  --model-id nvidia/parakeet-tdt_ctc-110m \
-  --output-dir parakeet_ctc_coreml
+# From pretrained (downloads from HuggingFace)
+uv run python convert-tdt-coreml.py \
+  --output-dir parakeet_tdt_coreml \
+  --audio-path audio/trace_15s.wav
+
+# From local .nemo checkpoint
+uv run python convert-tdt-coreml.py \
+  --nemo-path ./parakeet-tdt-ctc-110m.nemo \
+  --output-dir parakeet_tdt_coreml \
+  --audio-path audio/trace_15s.wav
+
+# Reuse a previously exported mel+encoder
+uv run python convert-tdt-coreml.py \
+  --reuse-encoder parakeet_tdt_coreml/Preprocessor.mlpackage \
+  --output-dir parakeet_tdt_coreml_v2
 ```
 
-Notes:
-- Export uses a fixed 15s window (audio_signal shape [1,240000]) to avoid runtime shape issues in CoreML preview.
-- Two-model export:
-  - `parakeet_ctc_mel_encoder.mlpackage` — waveform -> encoder, encoder_length
-  - `parakeet_ctc_decoder.mlpackage` — encoder -> log_probs
+Key differences from the 0.6B export:
+- **Fused frontend**: mel spectrogram + encoder are a single `Preprocessor.mlpackage` (0.6B has separate Preprocessor + Encoder)
+- **iOS 18 deployment target**: Required for int ops in the encoder's positional encoding
+- **Smaller dimensions**: encoderDim=512, decoderHidden=640, decoderLayers=1, vocabSize=1024
 
-Outputs:
+### Using with FluidAudio
 
-- `parakeet_ctc_mel_encoder.mlpackage` — waveform → CTC log_probs + encoder_length
-- `metadata.json` — input/output shapes and export configuration
+After export, compile the `.mlpackage` files to `.mlmodelc`:
 
-## Next Steps (Swift Integration)
+```bash
+xcrun coremlcompiler compile Preprocessor.mlpackage output_dir/
+xcrun coremlcompiler compile Decoder.mlpackage output_dir/
+xcrun coremlcompiler compile JointDecisionSingleStep.mlpackage output_dir/
+# Rename to match FluidAudio's expected name
+mv output_dir/JointDecisionSingleStep.mlmodelc output_dir/JointDecision.mlmodelc
+cp vocab.json output_dir/
+```
+
+Then run with:
+
+```bash
+fluidaudiocli transcribe audio.wav --model-version tdt-ctc-110m --model-dir output_dir/
+```
+
+## Export CTC (for keyword spotting)
+
+The `convert-coreml.py` script exports the CTC branch for keyword spotting:
+
+```bash
+uv run python convert-coreml.py convert \
+  --model-id nvidia/parakeet-tdt_ctc-110m \
+  --output-dir parakeet_ctc_coreml
+```
 
-Once exported and validated:
+This produces:
+- `parakeet_ctc_mel_encoder.mlpackage` — waveform → encoder features
+- `parakeet_ctc_decoder.mlpackage` — encoder → CTC log-probabilities
 
-- Point FluidAudio's CTC loader at the new `parakeet_ctc_mel_encoder.mlpackage` instead of `argmaxinc/ctckit-pro`.
-- Re-run the CTC keyword spotter parity checks (NeMo Python vs CoreML) on Kokoro TTS clips.
-- When parity is acceptable, re-enable time-aligned CTC boosting in Swift for Argmax-style Custom Vocabulary.
+Note: The CTC head is trained as an auxiliary loss and produces blank-dominant outputs. It is not suitable for standalone greedy transcription — use the TDT export for that.