Skip to content

fix(engine): stabilize dictation capture and release paste#1

Open
future3OOO wants to merge 1 commit into
masterfrom
fix/dictation-release-paste
Open

fix(engine): stabilize dictation capture and release paste#1
future3OOO wants to merge 1 commit into
masterfrom
fix/dictation-release-paste

Conversation

@future3OOO

@future3OOO future3OOO commented May 2, 2026

Copy link
Copy Markdown
Owner

Summary

  • Prevent mouse-hold dictation from transcribing overlapping VAD batches before release.
  • Harden Windows microphone fallback while keeping explicit input devices fail-closed.
  • Add regression coverage for release paste duplication and resampled audio callback behavior.

Test plan

  • .venv310\Scripts\python.exe -m pytest tests/test_engine_advanced.py::TestAdvancedDictationEngine::test_flush_hold_does_not_duplicate_vad_tail tests/test_engine_advanced.py::TestAdvancedDictationEngine::test_hold_mode_does_not_transcribe_vad_batches_before_release -q
  • .venv310\Scripts\python.exe -m pytest tests/test_io.py -q
  • .venv310\Scripts\python.exe -m black --check dictation_tool/engine.py dictation_tool/io.py tests/test_engine_advanced.py tests/test_io.py
  • git diff --check

Summary by CodeRabbit

  • Documentation

    • Updated latency specifications for medium-speed model (now 5–200 ms instead of 5–20 ms).
  • Improvements

    • Enhanced audio device compatibility with automatic fallback when requested sample rate unavailable.
    • Improved audio resampling for better cross-platform device support.
    • Expanded Windows device enumeration options.

Prevent mouse-hold dictation from transcribing overlapping VAD batches before release, and harden Windows microphone fallback so default devices are portable while explicit devices fail closed.

Co-authored-by: Cursor <cursoragent@cursor.com>
@coderabbitai

coderabbitai Bot commented May 2, 2026

Copy link
Copy Markdown

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 67dc6d77-bc63-4bdb-9865-835e0881173b

📥 Commits

Reviewing files that changed from the base of the PR and between 37b576c and 988adcb.

📒 Files selected for processing (5)
  • README.md
  • dictation_tool/engine.py
  • dictation_tool/io.py
  • tests/test_engine_advanced.py
  • tests/test_io.py

Walkthrough

This PR refactors the audio input pipeline by improving VADGate segment creation with dedicated helper methods, enhancing AudioStream with device fallback selection and automatic resampling in the callback, and adjusting engine VAD flush behavior to avoid duplicating tail samples. Tests validate device selection logic and VAD buffer behavior. Documentation updates latency expectations for the medium.en model.

Changes

Audio I/O & VAD Pipeline Refactoring

Layer / File(s) Summary
Interface Updates
dictation_tool/io.py
AudioStream.__init__ now accepts input_device: str | int | None (previously str | None), enabling numeric device IDs alongside names.
VADGate Segment Creation
dictation_tool/io.py
VADGate._process_frame refactored to delegate segment construction to new _create_segment() and _reset_for_next_utterance() helper methods, centralizing buffer-to-array concatenation logic.
AudioStream Device & Resampling
dictation_tool/io.py
_open_stream introduced with two-pass device opening (target rate first, fallback to native rate with callback resampling); _all_input_devices added for Windows WASAPI-first ordering; callback now resamples native-rate chunks to target rate via np.interp before queueing.
Engine VAD Flush Coordination
dictation_tool/engine.py
_flush_hold() VAD path now calls force_flush() but discards its tail when _raw_shadow is non-empty; only appends tail when shadow is empty, preventing duplicate VAD samples.
Test Validation
tests/test_engine_advanced.py, tests/test_io.py
New tests verify VAD tail deduplication, hold-mode transcription blocking, device fallback fail-closure, Windows default-device fallback, and callback resampling to target rate; VADGate parameter tests refactored to check pre_buffer_chunks/post_buffer_chunks instead of padding_ms.

Documentation Update

Layer / File(s) Summary
README Latency Expectations
README.md
"Maximum speed (medium.en + prompt tricks)" section updated from ~5–20 ms to ~5–200 ms interface latency.

Sequence Diagram

sequenceDiagram
    participant Client
    participant AudioStream
    participant Device Selection
    participant Native Device
    participant Resampler
    participant VADGate
    participant Engine

    Client->>AudioStream: open_stream(target_rate)
    AudioStream->>Device Selection: attempt target_rate device
    alt Device supports target rate
        Device Selection->>Native Device: open at target_rate
    else Fallback to native
        Device Selection->>Native Device: open at native_rate
    end
    
    Native Device->>Resampler: callback(native_rate_chunk)
    Resampler->>Resampler: np.interp if native ≠ target
    Resampler->>VADGate: _process_frame(target_rate_chunk)
    
    alt Speech detected
        VADGate->>VADGate: _create_segment()
        VADGate->>Engine: segment (no duplicate tail)
    else Silence buffered
        VADGate->>VADGate: accumulate in pre_buffer
    end
    
    Engine->>Engine: _flush_hold() with dedup logic
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Poem

🐰 A clearer path through the digital streams
Device fallbacks and resampling dreams,
VADGate speaks true without echoed tail,
Audio flow now won't ever fail! 🎙️✨

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 37.04% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and specifically summarizes the primary fix: stabilizing dictation capture and release in paste operations, which aligns directly with the main objectives of preventing VAD batch duplication and hardening microphone fallback.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/dictation-release-paste

Review rate limit: 3/5 reviews remaining, refill in 23 minutes and 22 seconds.

Comment @coderabbitai help to get the list of available commands and usage tips.

@greptile-apps

greptile-apps Bot commented May 2, 2026

Copy link
Copy Markdown

Greptile Summary

This PR fixes two hold-mode correctness bugs: overlapping VAD tail duplication on release (by restructuring _flush_hold to discard the VAD tail when raw shadow data is present, while still calling force_flush() for state reset), and premature VAD batch transcription during hold (by continue-ing past batch.append so the post-loop flush never fires on stale audio). It also adds a Windows microphone fallback that tries the device's native sample rate + np.interp resampling when the target rate is unsupported, keeping explicit device selection fail-closed (no silent recording from a different mic). Regression tests cover both engine fixes and all three AudioStream fallback paths.

Confidence Score: 4/5

Safe to merge — no correctness regressions found; one benign P2 in the WASAPI sort key.

All P2s only. Core hold-mode and resampling logic is correct, _flush_hold state-machine transition is sound, and both new test suites provide meaningful regression coverage. The only finding is a wrong dict key (d.get("index", 0)) in the WASAPI sort secondary key, which is benign due to Python's stable sort.

dictation_tool/io.py_all_input_devices sort key; otherwise no files require special attention.

Important Files Changed

Filename Overview
dictation_tool/engine.py Core fix: _flush_hold now discards the VAD tail (but still calls force_flush() for state reset) when _raw_shadow has data, preventing duplication; _run adds a continue guard during hold mode to keep batch empty so the post-loop flush never fires stale audio.
dictation_tool/io.py New resampling fallback in AudioStream: _open_stream tries target rate first, then native rate + np.interp; _build_device_candidates is fail-closed for explicit devices; sort secondary key in _all_input_devices always resolves to 0 (wrong dict key).
tests/test_engine_advanced.py Two new tests: test_flush_hold_does_not_duplicate_vad_tail (directly calls _flush_hold, asserts only shadow samples passed to transcribe) and test_hold_mode_does_not_transcribe_vad_batches_before_release (verifies post-loop batch stays empty during hold); both are valid regression tests.
tests/test_io.py Three new AudioStream tests covering: fail-closed explicit device, Windows default fallback candidate list, and resampled callback shape/dtype/raw-chunk equality.
README.md Latency bound updated (5–200 ms) and "good prompt" replaced with "preset" — documentation-only, consistent with the resampling/preset work.

Sequence Diagram

sequenceDiagram
    participant Mouse as Mouse Thread
    participant Run as _run() loop
    participant Shadow as _raw_shadow
    participant VAD as VADGate
    participant Flush as _flush_hold()
    participant Clip as _clip_worker

    Mouse->>Shadow: .clear() on press
    Mouse->>Run: _holding = True

    loop audio chunks during hold
        Run->>VAD: chunk arrives (via stream.chunks())
        Note over Run: mouse_hold_to_record & _holding → continue
        Note over Run: batch stays empty
        Run-->>Shadow: on_raw_chunk → shadow.append(chunk)
    end

    Mouse->>Flush: mouse release triggers _flush_hold()
    Flush->>Shadow: concatenate(_raw_shadow) → segs
    Flush->>Shadow: .clear()
    Flush->>VAD: force_flush() — reset state, discard tail (would duplicate shadow)
    Flush->>Clip: _transcribe(segs) → paste result
Loading

Comments Outside Diff (1)

  1. dictation_tool/io.py, line 654-658 (link)

    P2 Secondary sort key silently wrong — d.get("index", 0) always returns 0

    sounddevice device dicts do not have an "index" key; the device index is the loop variable i. As a result d.get("index", 0) is always 0, making the secondary sort key a no-op (all devices within a host-API priority group are treated as equally ranked). Python's stable sort keeps them in enumeration order, so the behavior is deterministic but not what the intent implies. The fix is to pass the outer pair[0] index instead.

    Fix in Codex

Fix All in Codex

Reviews (1): Last reviewed commit: "fix(engine): Stabilize dictation capture..." | Re-trigger Greptile

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant