Skip to content

PRD: Complete live modern dictation engine evaluation #3

@future3OOO

Description

@future3OOO

Problem Statement

The dictation performance PRD is not complete until the modern Whisper-family engines are installed and live-tested on the target Windows/CUDA runtime. The current branch implemented the measurement path and architecture seams, but the active environment is still running faster-whisper 1.1.1, while upstream latest is 1.2.1. That means large-v3-turbo and distil-large-v3 have not yet been validated in the actual runtime.

From the user's perspective, the tool should not claim a modern model-stack upgrade until the app has proven which engine is fastest and accurate enough on the machine it actually runs on.

Solution

Complete a live modern-engine evaluation pass. Upgrade the active virtual environment to the latest supported faster-whisper runtime, verify the CUDA/CTranslate2 library stack, run the benchmark matrix against the current and candidate models, then update the recommended default/profile only from measured evidence.

The evaluation should compare the current accuracy baseline, the current practical fast baseline, the modern balanced candidate, and the modern English-speed candidate. The result should be a recommendation, not an assumption.

User Stories

  1. As a dictation user, I want the default model choice to be based on live measurements, so that speed improvements do not reduce real accuracy.
  2. As a dictation user, I want large-v3-turbo tested on the local GPU, so that I know whether it is actually better for short dictation.
  3. As a dictation user, I want distil-large-v3 tested on English dictation, so that I can choose a lower-latency English profile when appropriate.
  4. As a dictation user, I want the current large-v3 baseline preserved during evaluation, so that the upgrade has a fair accuracy comparison.
  5. As a dictation user, I want medium.en kept in the comparison, so that practical speed is compared against a known fast baseline.
  6. As a developer, I want the active environment upgraded to the latest supported faster-whisper release, so that model support is not blocked by stale dependencies.
  7. As a developer, I want CTranslate2 CUDA compatibility verified, so that model failures are identified as runtime-stack issues rather than app bugs.
  8. As a developer, I want benchmark output captured as JSONL, so that results can be compared and retained.
  9. As a developer, I want model load time measured separately from inference time, so that cold-start and release-to-paste behavior are not confused.
  10. As a developer, I want transcript quality measured on real speech, so that synthetic benchmark audio does not hide accuracy regressions.
  11. As a developer, I want VAD policy measured with modern engines, so that WebRTC VAD and faster-whisper VAD do not double-filter speech unnecessarily.
  12. As a maintainer, I want README recommendations updated only after measured results, so that users are not guided by stale claims.
  13. As a maintainer, I want dependency constraints aligned with the evaluated runtime, so that installs reproduce the tested stack.
  14. As a maintainer, I want any modern-engine blocker recorded explicitly, so that incomplete evaluation is not mistaken for completion.

Implementation Decisions

  • Treat live model evaluation as a required completion gate for the dictation performance PRD.
  • Upgrade the active runtime from faster-whisper 1.1.1 to the latest supported faster-whisper release before testing modern candidates.
  • Verify the CTranslate2, CUDA, cuBLAS, and cuDNN runtime assumptions before interpreting benchmark failures.
  • Compare the current accuracy baseline, current fast baseline, large-v3-turbo candidate, and distil-large-v3 candidate.
  • Keep the transcription backend seam, but recognize that one adapter is still a hypothetical seam until another adapter or model-specific policy proves it.
  • Keep the default model unchanged until live benchmark and accuracy evidence supports changing it.
  • Treat VAD ownership as unresolved until measured against modern engines.
  • Treat unused CLI/config tuning knobs as architectural debt unless they are either wired into runtime behavior or removed from the interface.

Testing Decisions

  • Live model testing must run with the real faster-whisper backend, not only mocks.
  • GPU smoke testing should continue to be separate from fast CPU unit tests.
  • The model matrix should include latency, model load time, real-time factor, memory pressure where practical, and transcript quality.
  • Synthetic sine benchmark data is acceptable for plumbing but not sufficient for accuracy or model recommendation.
  • At least one real-speech fixture or live dictation sample should be used for each candidate model.
  • VAD testing should include WebRTC VAD plus faster-whisper VAD, WebRTC-only, and model-VAD-only where practical.
  • Existing CPU gates remain required: unit tests, formatting, linting, and type checking.

Out of Scope

  • Training or fine-tuning a custom model.
  • Replacing faster-whisper with a cloud transcription provider.
  • Building a GUI.
  • Changing the default model before measured results exist.
  • Making large model downloads mandatory for ordinary CPU CI.

Further Notes

Current investigation found the active venv is behind upstream: faster-whisper 1.1.1 installed versus 1.2.1 latest, CTranslate2 4.6.0 installed versus 4.7.1 latest. The local GPU is available as an NVIDIA GeForce RTX 3080 with driver 591.74 and CUDA 13.1 driver capability, while PyTorch is 2.5.1+cu121.

Upstream faster-whisper documents CUDA 12 plus cuDNN 9 requirements for latest CTranslate2 releases. Windows users may need compatible cuBLAS/cuDNN libraries on PATH before CTranslate2 CUDA inference works. If that stack fails, record it as a runtime dependency blocker rather than changing app defaults.

Metadata

Metadata

Assignees

No one assigned

    Labels

    needs-triageMaintainer needs to evaluate

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions