feat: support parakeet-tdt-ctc-110m hybrid model#383
feat: support parakeet-tdt-ctc-110m hybrid model#383JarbasAl wants to merge 1 commit intoFluidInference:mainfrom
Conversation
Add AsrModelVersion.tdtCtc110m for the 110M parameter hybrid TDT-CTC model. Key differences from the 0.6B models: - Fused preprocessor+encoder (no separate Encoder.mlmodelc) - Smaller dimensions: encoderHidden=512, vocabSize=1024, 1 LSTM layer - Array-format vocabulary (vocab.json) instead of dict format - blankId=1024 (same as v2) Changes: - AsrModels: optional encoder, fused frontend loading, array vocab support - AsrManager: version-aware decoder state shapes, fused frontend availability - AsrTranscription: skip encoder step when preprocessor output is fused - TdtDecoderState: parameterized LSTM layer count - TdtDecoderV3: use config.encoderHiddenSize instead of auto-detection - EncoderFrameView: accept explicit hidden size parameter - TranscribeCommand: --model-version tdt-ctc-110m, --model-dir flags - ModelNames: parakeetTdtCtc110m repo, fused model requirements
| switch models.version { | ||
| case .v2: | ||
| case .v2, .tdtCtc110m: | ||
| let decoder = TdtDecoderV2(config: config) |
There was a problem hiding this comment.
🔴 Missing encoderHiddenSize adaptation causes runtime crash for tdtCtc110m with default config
When AsrManager is created with ASRConfig.default (or any config that doesn't explicitly set encoderHiddenSize), it defaults to ASRConstants.encoderHiddenSize (1024). If then initialized with tdtCtc110m models (which produce encoder output with hidden size 512), transcription will fail at runtime with "Encoder hidden size mismatch" in EncoderFrameView (Sources/FluidAudio/ASR/TDT/EncoderFrameView.swift:32-33).
The blankId mismatch is handled by TdtDecoderV2.adaptConfigForV2 (Sources/FluidAudio/ASR/TDT/TdtDecoderV2.swift:55-74), but encoderHiddenSize is never adapted. AsrManager.initialize(models:) has the model version info (models.version.encoderHiddenSize returns 512 for 110m) but neither validates nor adapts the config. Since ASRConfig is stored as let on AsrManager, it cannot be corrected after init. This means the natural usage pattern AsrManager() → initialize(models: tdtCtc110m) silently accepts the mismatch and crashes only during transcription.
Example that triggers the crash
let models = try await AsrModels.downloadAndLoad(version: .tdtCtc110m)
let manager = AsrManager() // encoderHiddenSize defaults to 1024
try await manager.initialize(models: models)
let result = try await manager.transcribe(url) // CRASH in EncoderFrameView
Prompt for agents
In Sources/FluidAudio/ASR/AsrManager.swift, in the tdtDecodeWithTimings method (around line 306-348), the config is passed directly to TdtDecoderV2/V3 without adapting encoderHiddenSize based on the model version. Since models.version is already available at this point (line 317-319), create an adapted config that uses models.version.encoderHiddenSize before passing it to the decoder.
Specifically, around line 320-322, where `let decoder = TdtDecoderV2(config: config)` is called, replace `config` with a version that has the correct encoderHiddenSize from models.version.encoderHiddenSize. Same for the v3 case at line 335.
Alternatively, add validation in initialize(models:) at line 105-120 that throws an error if config.encoderHiddenSize != models.version.encoderHiddenSize, giving the user a clear error at initialization time rather than a cryptic error during transcription.
Was this helpful? React with 👍 or 👎 to provide feedback.
|
@JarbasAl did you test this on iOS , we had originally fused preprocessor+encoder before & it had incompatibility issues on iOS . also what about the benchmarks |
Add AsrModelVersion.tdtCtc110m for the 110M parameter hybrid TDT-CTC model. Key differences from the 0.6B models:
Changes:
Companion PR: FluidInference/mobius#25
Why is this change needed?
better support for https://huggingface.co/nvidia/parakeet-tdt_ctc-110m
AI Disclosure
I never worked with swift before, Claude Opus did most of the work