Problem Statement
The dictation tool works, but its architecture and verification surface have drifted from its production goals. Users need lower latency, better accuracy, and more efficient GPU usage, but the current implementation does not provide a trustworthy way to compare model upgrades, runtime dependency changes, VAD policies, or batching settings.
From the user's perspective, the tool should feel immediate: hold to speak, release, and get a single accurate paste with minimal delay. It should also run a cutting-edge Whisper-family model stack that balances speed and accuracy on modern Windows/CUDA machines, while remaining usable on other systems where possible.
The current system has several blockers to safe improvement: benchmarking is stale, some CLI/config surfaces are not wired or tested accurately, model selection is tightly coupled to the main engine, VAD ownership is unclear, and post-processing accuracy rules are embedded in the transcription path. This makes targeted performance work risky because there is no reliable baseline for latency, throughput, or transcript quality.
Solution
Build a production-grade performance and accuracy improvement program for the dictation tool. The work should first restore reliable measurement, then repair configuration and test drift, then introduce deeper modules for transcription, segmentation, and post-processing so model and runtime changes can be evaluated safely.
The intended outcome is a tool that can run a modern cutting-edge model option such as large-v3-turbo or distil-large-v3 where appropriate, while keeping the current practical modes available for comparison. The final model choice should be evidence-based, using repeatable benchmarks and accuracy fixtures rather than assumptions.
The implementation should improve speed, efficiency, and latency without sacrificing reliability. It should also make future model upgrades easier by isolating the transcription backend behind a small interface and adding a repeatable benchmark suite.
User Stories
- As a dictation user, I want the tool to paste once per release, so that I do not get duplicated text.
- As a dictation user, I want release-to-paste latency to be consistently low, so that the tool feels responsive during normal writing.
- As a dictation user, I want high transcription accuracy for email dictation, so that I spend less time correcting names, punctuation, and formatting.
- As a dictation user, I want a fast default model profile, so that I can dictate short messages without waiting for a large model unnecessarily.
- As a dictation user, I want a maximum-accuracy model profile, so that I can choose quality over latency for important dictation.
- As a dictation user, I want a balanced model profile using current Whisper-family options, so that I get strong quality and speed without manual tuning.
- As a dictation user, I want explicit microphone selection to stay on the selected device, so that the app does not silently record from the wrong mic.
- As a dictation user, I want default microphone mode to work across common Windows devices, so that setup is easier on different systems.
- As a dictation user, I want clear guidance for recommended model choices, so that I know when to use medium.en, large-v3, large-v3-turbo, or distil-large-v3.
- As a dictation user, I want model startup time to be measured, so that I understand cold-start cost separately from dictation latency.
- As a dictation user, I want inference latency to be measured, so that model choices are based on real performance.
- As a dictation user, I want end-to-end hold-release latency to be measured, so that the benchmark reflects the experience I actually feel.
- As a dictation user, I want transcript quality to be measured on realistic samples, so that speed improvements do not quietly reduce accuracy.
- As a developer, I want a benchmark module with stable fixtures, so that every model or dependency change can be compared against a baseline.
- As a developer, I want benchmark output to include latency percentiles, so that one-off fast runs do not hide slow tail behavior.
- As a developer, I want benchmark output to include real-time factor, so that long and short audio workloads can be compared consistently.
- As a developer, I want benchmark output to include model load time, so that runtime startup regressions are visible.
- As a developer, I want benchmark output to include memory usage where practical, so that model choices account for user hardware limits.
- As a developer, I want an accuracy fixture set with reference transcripts, so that model upgrades can be checked for word and formatting regressions.
- As a developer, I want post-processing golden tests, so that email formatting and command cue behavior are protected.
- As a developer, I want the CLI contract to match configuration behavior, so that flags users pass are actually applied.
- As a developer, I want stale tests corrected or removed, so that the suite reflects the current implementation instead of old assumptions.
- As a developer, I want the transcription backend isolated behind a small interface, so that faster-whisper upgrades and model swaps are localized.
- As a developer, I want the VAD policy isolated behind a segmentation module, so that WebRTC VAD and Whisper VAD can be tested independently.
- As a developer, I want post-processing isolated behind a text cleanup module, so that formatting improvements do not require editing the engine.
- As a developer, I want profile output available through the CLI, so that production runs can be measured without writing custom code.
- As a developer, I want a model matrix benchmark, so that current large-v3, medium.en, large-v3-turbo, and distil-large-v3 can be compared fairly.
- As a developer, I want compute mode comparisons, so that float16, int8, and int8-float16 trade-offs are visible.
- As a developer, I want dependency versions pinned coherently, so that installs from package metadata and requirements files do not diverge.
- As a developer, I want the CUDA and CTranslate2 compatibility assumptions documented, so that other users can install the correct runtime stack.
- As a developer, I want GPU smoke tests to remain separate from CPU unit tests, so that normal CI stays fast and GPU validation remains available.
- As a developer, I want runtime process cleanup guidance, so that orphaned dictation processes do not cause duplicate paste behavior.
- As a developer, I want queue back-pressure and audio drop behavior to be observable, so that missed audio can be diagnosed.
- As a developer, I want batching behavior to be benchmarked, so that latency and throughput trade-offs are explicit.
- As a developer, I want VAD silence thresholds to be benchmarked, so that clipped syllables and slow release behavior can be balanced.
- As a developer, I want model prompt behavior to be tested, so that domain terms and email formatting remain accurate across model changes.
- As a maintainer, I want a staged rollout plan, so that measurement improvements land before risky model changes.
- As a maintainer, I want each performance change to have before-and-after evidence, so that regressions are caught before release.
- As a maintainer, I want README recommendations to reflect measured results, so that users are not guided by stale latency claims.
- As a maintainer, I want changelog entries for model and runtime changes, so that users understand compatibility and behavior changes.
- As a user on a non-identical system, I want conservative fallback behavior, so that the app either works correctly or fails clearly.
- As a user with limited VRAM, I want a lower-memory model option, so that I can still use dictation without exhausting system resources.
- As a user with a high-end GPU, I want the tool to use the hardware efficiently, so that latency is as low as practical.
- As a user writing professional emails, I want post-processing to preserve paragraph and sign-off formatting, so that pasted output is ready to send.
- As a user dictating technical terms, I want prompts and vocabulary hints to remain effective, so that proper nouns and code terms are transcribed correctly.
Implementation Decisions
- Build or restore a dedicated performance measurement module before changing the default model stack.
- Define benchmark scenarios for cold start, warm inference, short utterances, email-length utterances, and longer hold-to-record dictation.
- Add an accuracy fixture set with reference transcripts covering general dictation, professional email, names, URLs, punctuation commands, paragraph commands, and technical vocabulary.
- Introduce a transcription backend module with a small interface for loading a model and transcribing audio.
- Keep faster-whisper as the first backend to evaluate, but make the interface deep enough to support comparing newer faster-whisper models and runtime settings without editing the engine orchestration.
- Compare at least the current default model, current fast practical model, large-v3-turbo, and distil-large-v3.
- Treat large-v3-turbo as the leading balanced candidate because it targets much faster Whisper-family inference while retaining multilingual support.
- Treat distil-large-v3 as a high-priority English-only candidate because it can offer strong speed with competitive accuracy for English dictation.
- Do not replace the default model until benchmark and accuracy evidence supports the change.
- Introduce a segmentation policy module to own whether audio is segmented by WebRTC VAD, faster-whisper VAD, both, or neither.
- Benchmark duplicate VAD policies because double filtering may add latency or trim speech unnecessarily.
- Keep explicit microphone selection fail-closed while preserving default-device fallback behavior for portability.
- Introduce a post-processing module for command cues, email formatting, URL cleanup, punctuation normalization, and line capitalization.
- Keep post-processing deterministic and covered by golden tests.
- Repair CLI and configuration drift so every documented performance flag maps to runtime behavior.
- Restore the documented benchmark command or remove the advertised interface if a replacement command is chosen.
- Add a CLI-accessible profiling option so production runs can emit timing data without custom code.
- Align dependency declarations so the package metadata and requirements install the same supported runtime stack.
- Document CUDA, CTranslate2, PyTorch, and faster-whisper compatibility constraints for Windows users.
- Keep GPU-specific validation separate from fast unit tests.
- Add process cleanup guidance or tooling if orphaned dictation processes remain a recurring operational issue.
Testing Decisions
- Good tests should validate externally observable behavior: transcript text, formatting output, latency measurements, model selection, CLI contract, and process behavior.
- Avoid tests that only assert internal helper calls unless the helper is itself a stable module interface.
- Test the benchmark module with small deterministic fixtures and schema assertions for timing output.
- Test the transcription backend module with fake adapters for unit tests and real faster-whisper adapters in smoke or integration tests.
- Test the segmentation module with controlled audio arrays that prove start, stop, silence, and no-speech behavior.
- Test the post-processing module with golden input-output examples for email formatting, command cues, URLs, email addresses, sign-offs, bullets, and capitalization.
- Test CLI and configuration as a contract: every documented flag must map to the expected runtime setting.
- Add accuracy regression tests that compare fixture transcripts against expected output using exact matches for post-processing and a tolerant WER-style metric for model output.
- Add latency benchmark tests that are not normal unit tests but can be run locally or in a GPU-enabled environment.
- Preserve fast CPU unit tests for developer feedback.
- Preserve GPU smoke tests for model-load and inference sanity.
- Add a model matrix benchmark that can be run manually before changing the default model recommendation.
- Use existing mocked engine tests as prior art for control-flow tests, but update or remove stale expectations that no longer match the current implementation.
- Use existing audio I/O tests as prior art for microphone fallback and resampling behavior.
- Use the existing JSONL profiling approach as prior art for timing output, but make it accessible and documented through the benchmark interface.
Out of Scope
- Building a full GUI.
- Replacing the application with a cloud transcription service.
- Supporting every non-Windows platform as a first-class target in this PRD.
- Guaranteeing one model is best before benchmark evidence exists.
- Training or fine-tuning a custom speech model.
- Rewriting the entire dictation engine in another language.
- Adding enterprise deployment packaging unless benchmark evidence shows packaging is a bottleneck.
- Changing the clipboard and paste UX beyond what is required for latency, duplication prevention, and process reliability.
Further Notes
This work should be delivered in stages. The first stage should restore measurement and fix verification drift. The second stage should introduce deep modules around transcription, segmentation, and post-processing. The third stage should run model and runtime comparisons and update defaults only after evidence is available.
The core architectural direction is to increase locality and leverage: the engine should orchestrate the dictation flow, while specialized modules own model inference, segmentation policy, benchmarking, profiling, and post-processing. That will make the tool easier to tune as Whisper-family models and faster-whisper runtimes continue to improve.
Problem Statement
The dictation tool works, but its architecture and verification surface have drifted from its production goals. Users need lower latency, better accuracy, and more efficient GPU usage, but the current implementation does not provide a trustworthy way to compare model upgrades, runtime dependency changes, VAD policies, or batching settings.
From the user's perspective, the tool should feel immediate: hold to speak, release, and get a single accurate paste with minimal delay. It should also run a cutting-edge Whisper-family model stack that balances speed and accuracy on modern Windows/CUDA machines, while remaining usable on other systems where possible.
The current system has several blockers to safe improvement: benchmarking is stale, some CLI/config surfaces are not wired or tested accurately, model selection is tightly coupled to the main engine, VAD ownership is unclear, and post-processing accuracy rules are embedded in the transcription path. This makes targeted performance work risky because there is no reliable baseline for latency, throughput, or transcript quality.
Solution
Build a production-grade performance and accuracy improvement program for the dictation tool. The work should first restore reliable measurement, then repair configuration and test drift, then introduce deeper modules for transcription, segmentation, and post-processing so model and runtime changes can be evaluated safely.
The intended outcome is a tool that can run a modern cutting-edge model option such as large-v3-turbo or distil-large-v3 where appropriate, while keeping the current practical modes available for comparison. The final model choice should be evidence-based, using repeatable benchmarks and accuracy fixtures rather than assumptions.
The implementation should improve speed, efficiency, and latency without sacrificing reliability. It should also make future model upgrades easier by isolating the transcription backend behind a small interface and adding a repeatable benchmark suite.
User Stories
Implementation Decisions
Testing Decisions
Out of Scope
Further Notes
This work should be delivered in stages. The first stage should restore measurement and fix verification drift. The second stage should introduce deep modules around transcription, segmentation, and post-processing. The third stage should run model and runtime comparisons and update defaults only after evidence is available.
The core architectural direction is to increase locality and leverage: the engine should orchestrate the dictation flow, while specialized modules own model inference, segmentation policy, benchmarking, profiling, and post-processing. That will make the tool easier to tune as Whisper-family models and faster-whisper runtimes continue to improve.