feat: add TDT CoreML export for parakeet-tdt-ctc-110m#25
feat: add TDT CoreML export for parakeet-tdt-ctc-110m#25JarbasAl wants to merge 1 commit intoFluidInference:mainfrom
Conversation
Add convert-tdt-coreml.py which exports the TDT decoder components (fused mel+encoder, RNNT decoder LSTM, joint decision with duration) instead of the CTC head. The CTC export only produces blank-dominant log-probabilities unsuitable for greedy transcription in hybrid models. Components: - convert-tdt-coreml.py: Full TDT export pipeline (iOS 18 target) - individual_components.py: Shared torch.nn.Module wrappers for tracing - Updated README.md: Documents both TDT and CTC export paths - Updated pyproject.toml: Adds script entry point and includes
| ): | ||
| logits = self.joint(encoder_outputs, decoder_outputs) | ||
| token_logits = logits[..., : self.vocab_with_blank] | ||
| duration_logits = logits[..., -self.num_extra :] |
There was a problem hiding this comment.
🔴 -0: slice returns all logits when num_extra == 0, producing incorrect duration outputs
When num_extra is 0 (plain RNNT model without TDT duration head), logits[..., -self.num_extra :] evaluates to logits[..., -0:] which in Python is equivalent to logits[..., 0:] — returning all logits instead of an empty tensor. This means duration_logits would contain the full joint output (vocab + blank), and torch.argmax(duration_logits, dim=-1) would produce meaningless duration values based on token logits rather than duration bins.
The same issue exists in both JointDecisionWrapper (individual_components.py:146) and JointDecisionSingleStep (individual_components.py:179). The code in convert-tdt-coreml.py:226-230 warns about num_extra == 0 but continues the export, producing a model that silently emits incorrect duration predictions.
Prompt for agents
Fix the -0: slicing bug in both JointDecisionWrapper.forward() (individual_components.py:146) and JointDecisionSingleStep.forward() (individual_components.py:179). When self.num_extra is 0, logits[..., -0:] returns all logits instead of an empty slice. Either:
1. Guard the duration slice: use logits[..., self.vocab_with_blank :] instead of logits[..., -self.num_extra :], which correctly returns an empty tensor when vocab_with_blank equals the total logit dimension. Or:
2. Add a conditional: if self.num_extra > 0, compute duration_logits normally; otherwise return a zeros tensor of the appropriate shape for duration.
Additionally, in convert-tdt-coreml.py:226-230, consider raising an error or skipping the JointDecision export entirely when num_extra == 0 rather than continuing with a broken duration head.
Was this helpful? React with 👍 or 👎 to provide feedback.
| model: ct.models.MLModel, path: Path, description: str | ||
| ) -> None: | ||
| try: | ||
| model.minimum_deployment_target = ct.target.iOS17 |
There was a problem hiding this comment.
🔴 _save_mlpackage overwrites iOS18 deployment target with iOS17, creating invalid metadata
The _save_mlpackage function at convert-tdt-coreml.py:58 unconditionally sets model.minimum_deployment_target = ct.target.iOS17, but the TDT export converts all models with deployment_target=ct.target.iOS18 (convert-tdt-coreml.py:188). The README explicitly states "iOS 18 deployment target: Required for int ops in the encoder's positional encoding."
If coremltools allows this downgrade (the try/except may not catch it since it's just a metadata property), the saved .mlpackage will claim iOS 17 compatibility while containing iOS 18-specific operations. This would cause iOS 17 devices to attempt loading the model and fail at runtime with confusing errors, rather than getting a clear "requires iOS 18" rejection. This function was copied from convert-coreml.py:71 where iOS17 was the correct target.
| model.minimum_deployment_target = ct.target.iOS17 | |
| model.minimum_deployment_target = ct.target.iOS18 |
Was this helpful? React with 👍 or 👎 to provide feedback.
Add convert-tdt-coreml.py which exports the TDT decoder components (fused mel+encoder, RNNT decoder LSTM, joint decision with duration) instead of the CTC head. The CTC export only produces blank-dominant log-probabilities unsuitable for greedy transcription in hybrid models.
Components:
companion PR: FluidInference/FluidAudio#383
AI Disclosure
Claude Opus did most of the work