Skip to content

feat: support parakeet-tdt-ctc-110m hybrid model#383

Open
JarbasAl wants to merge 1 commit intoFluidInference:mainfrom
TigreGotico:feat/tdt-ctc-110m-support
Open

feat: support parakeet-tdt-ctc-110m hybrid model#383
JarbasAl wants to merge 1 commit intoFluidInference:mainfrom
TigreGotico:feat/tdt-ctc-110m-support

Conversation

@JarbasAl
Copy link

@JarbasAl JarbasAl commented Mar 16, 2026

Add AsrModelVersion.tdtCtc110m for the 110M parameter hybrid TDT-CTC model. Key differences from the 0.6B models:

  • Fused preprocessor+encoder (no separate Encoder.mlmodelc)
  • Smaller dimensions: encoderHidden=512, vocabSize=1024, 1 LSTM layer
  • Array-format vocabulary (vocab.json) instead of dict format
  • blankId=1024 (same as v2)

Changes:

  • AsrModels: optional encoder, fused frontend loading, array vocab support
  • AsrManager: version-aware decoder state shapes, fused frontend availability
  • AsrTranscription: skip encoder step when preprocessor output is fused
  • TdtDecoderState: parameterized LSTM layer count
  • TdtDecoderV3: use config.encoderHiddenSize instead of auto-detection
  • EncoderFrameView: accept explicit hidden size parameter
  • TranscribeCommand: --model-version tdt-ctc-110m, --model-dir flags
  • ModelNames: parakeetTdtCtc110m repo, fused model requirements

Companion PR: FluidInference/mobius#25

Why is this change needed?

better support for https://huggingface.co/nvidia/parakeet-tdt_ctc-110m

AI Disclosure

I never worked with swift before, Claude Opus did most of the work


Open with Devin

Add AsrModelVersion.tdtCtc110m for the 110M parameter hybrid TDT-CTC
model. Key differences from the 0.6B models:

- Fused preprocessor+encoder (no separate Encoder.mlmodelc)
- Smaller dimensions: encoderHidden=512, vocabSize=1024, 1 LSTM layer
- Array-format vocabulary (vocab.json) instead of dict format
- blankId=1024 (same as v2)

Changes:
- AsrModels: optional encoder, fused frontend loading, array vocab support
- AsrManager: version-aware decoder state shapes, fused frontend availability
- AsrTranscription: skip encoder step when preprocessor output is fused
- TdtDecoderState: parameterized LSTM layer count
- TdtDecoderV3: use config.encoderHiddenSize instead of auto-detection
- EncoderFrameView: accept explicit hidden size parameter
- TranscribeCommand: --model-version tdt-ctc-110m, --model-dir flags
- ModelNames: parakeetTdtCtc110m repo, fused model requirements
Copy link
Contributor

@devin-ai-integration devin-ai-integration bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Devin Review found 1 potential issue.

View 7 additional findings in Devin Review.

Open in Devin Review

Comment on lines 320 to 322
switch models.version {
case .v2:
case .v2, .tdtCtc110m:
let decoder = TdtDecoderV2(config: config)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 Missing encoderHiddenSize adaptation causes runtime crash for tdtCtc110m with default config

When AsrManager is created with ASRConfig.default (or any config that doesn't explicitly set encoderHiddenSize), it defaults to ASRConstants.encoderHiddenSize (1024). If then initialized with tdtCtc110m models (which produce encoder output with hidden size 512), transcription will fail at runtime with "Encoder hidden size mismatch" in EncoderFrameView (Sources/FluidAudio/ASR/TDT/EncoderFrameView.swift:32-33).

The blankId mismatch is handled by TdtDecoderV2.adaptConfigForV2 (Sources/FluidAudio/ASR/TDT/TdtDecoderV2.swift:55-74), but encoderHiddenSize is never adapted. AsrManager.initialize(models:) has the model version info (models.version.encoderHiddenSize returns 512 for 110m) but neither validates nor adapts the config. Since ASRConfig is stored as let on AsrManager, it cannot be corrected after init. This means the natural usage pattern AsrManager() → initialize(models: tdtCtc110m) silently accepts the mismatch and crashes only during transcription.

Example that triggers the crash

let models = try await AsrModels.downloadAndLoad(version: .tdtCtc110m)
let manager = AsrManager() // encoderHiddenSize defaults to 1024
try await manager.initialize(models: models)
let result = try await manager.transcribe(url) // CRASH in EncoderFrameView

Prompt for agents
In Sources/FluidAudio/ASR/AsrManager.swift, in the tdtDecodeWithTimings method (around line 306-348), the config is passed directly to TdtDecoderV2/V3 without adapting encoderHiddenSize based on the model version. Since models.version is already available at this point (line 317-319), create an adapted config that uses models.version.encoderHiddenSize before passing it to the decoder.

Specifically, around line 320-322, where `let decoder = TdtDecoderV2(config: config)` is called, replace `config` with a version that has the correct encoderHiddenSize from models.version.encoderHiddenSize. Same for the v3 case at line 335.

Alternatively, add validation in initialize(models:) at line 105-120 that throws an error if config.encoderHiddenSize != models.version.encoderHiddenSize, giving the user a clear error at initialization time rather than a cryptic error during transcription.
Open in Devin Review

Was this helpful? React with 👍 or 👎 to provide feedback.

@Alex-Wengg
Copy link
Member

Alex-Wengg commented Mar 16, 2026

@JarbasAl did you test this on iOS , we had originally fused preprocessor+encoder before & it had incompatibility issues on iOS .

also what about the benchmarks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants