Skip to content

feat: expose native Whisper decoding options#3717

Open
sriharan0804 wants to merge 3 commits into
docling-project:mainfrom
sriharan0804:feat-native-whisper-options
Open

feat: expose native Whisper decoding options#3717
sriharan0804 wants to merge 3 commits into
docling-project:mainfrom
sriharan0804:feat-native-whisper-options

Conversation

@sriharan0804

Copy link
Copy Markdown

Summary

This PR exposes additional decoding options for the native Whisper ASR backend and forwards them to whisper.transcribe().

Changes

  • Added beam_size to InlineAsrNativeWhisperOptions.
  • Added condition_on_previous_text to InlineAsrNativeWhisperOptions.
  • Forwarded both options to the native Whisper transcribe() call in _NativeWhisperModel.

This allows users to configure Whisper's decoding behavior through Docling instead of relying on Whisper's internal defaults.

Closes #3703.

@github-actions

Copy link
Copy Markdown
Contributor

DCO Check Passed

Thanks @sriharan0804, all your commits are properly signed off. 🎉

@mergify

mergify Bot commented Jun 27, 2026

Copy link
Copy Markdown
Contributor

Merge Protections

🟢 Merge protection satisfied — ready to merge.

Show 1 satisfied protection

🟢 Enforce conventional commit

Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/

  • title ~= ^(fix|feat|docs|style|refactor|perf|test|build|ci|chore|revert)(?:\(.+\))?(!)?:

Signed-off-by: sriharan2005@Tamil-- <sriharan0804@users.noreply.github.com>
@sriharan0804 sriharan0804 force-pushed the feat-native-whisper-options branch from ced7da1 to 16898c5 Compare June 27, 2026 06:36
Signed-off-by: sriharan2005@Tamil-- <sriharan0804@users.noreply.github.com>
PeterStaar-IBM
PeterStaar-IBM previously approved these changes Jun 27, 2026

@PeterStaar-IBM PeterStaar-IBM left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm!

@codecov

codecov Bot commented Jun 27, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

)
),
] = None
condition_on_previous_text: Annotated[

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

condition_on_previous_text: Annotated[bool, …] = None — a type/default mismatch (annotated bool, defaulted None). Because it's always forwarded, whisper receives None, which is falsy, so conditioning is effectively OFF by default. The field doc says "When unset, Whisper uses its default" — that's incorrect: passing None is not the same as omitting the argument; whisper's real default is True.

Comment on lines +221 to +224
self.language = asr_options.language
self.beam_size = asr_options.beam_size
self.condition_on_previous_text = asr_options.condition_on_previous_text
self.temperature = asr_options.temperature

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Love forwarding more than just the two that I did...right choice IMHO.

@sriharan0804 sriharan0804 force-pushed the feat-native-whisper-options branch from 2d614d9 to 24e478f Compare June 27, 2026 13:35
@sriharan0804

Copy link
Copy Markdown
Author

@BBC-Esq
Thanks for the review and the feedback!

I'm glad you liked forwarding language and temperature as well. I also agree with your point about condition_on_previous_text—using None while always forwarding the argument doesn't preserve Whisper's default behavior.

I've updated the default to True to match Whisper's default while still allowing users to explicitly set it to False.

@BBC-Esq

BBC-Esq commented Jun 27, 2026

Copy link
Copy Markdown

If this PR is meant to resolve #3703, the beam_size default should be 1, not None. beam_size=None selects whisper's GreedyDecoder (decoding.py:551) — the greedy decode path prone to the long-form repetition loop the issue describes — while any non-None value routes through BeamSearchDecoder instead (decoding.py:546-547). As written, the default ships the same greedy behavior the issue reports...

@BBC-Esq

BBC-Esq commented Jun 27, 2026

Copy link
Copy Markdown

@BBC-Esq Thanks for the review and the feedback!

I'm glad you liked forwarding language and temperature as well. I also agree with your point about condition_on_previous_text—using None while always forwarding the argument doesn't preserve Whisper's default behavior.

I've updated the default to True to match Whisper's default while still allowing users to explicitly set it to False.

IMHO, it should set it to "false" by default, not the openai-whisper library's "True" due to the spiraling issue, but then allow a user to set it to "True" if they really wanted to.

@sriharan0804

Copy link
Copy Markdown
Author

@BBC-Esq Thanks for the additional feedback. That makes sense.

My initial goal was to expose the native Whisper decoding options while keeping the defaults backward-compatible with the current behavior. I agree that using beam_size=1 and condition_on_previous_text=False could provide a better out-of-the-box experience for the long-form repetition issue described in #3703.

I'm happy to update the defaults if the maintainers would prefer the PR to change the default behavior rather than simply expose the options.

Signed-off-by: sriharan2005@Tamil-- <sriharan0804@users.noreply.github.com>
@sriharan0804

Copy link
Copy Markdown
Author

@BBC-Esq
Thanks for the detailed review!

I've addressed the requested changes:

Please let me know if you'd like any further changes.

@BBC-Esq

BBC-Esq commented Jun 28, 2026

Copy link
Copy Markdown

@BBC-Esq Thanks for the detailed review!

I've addressed the requested changes:

* Updated `condition_on_previous_text` to avoid the `None` default.

* Changed the defaults to `beam_size=1` and `condition_on_previous_text=False` to better address the long-form decoding issue discussed in [[feat] Native Whisper: greedy decode default hangs on long-form audio - expose beam_size / condition_on_previous_text / temperature #3703](https://github.com/docling-project/docling/issues/3703).

* Updated the option descriptions to reflect the new defaults while continuing to forward `language`, `beam_size`, `condition_on_previous_text`, and `temperature` to the native Whisper backend.

Please let me know if you'd like any further changes.

LGTM!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[feat] Native Whisper: greedy decode default hangs on long-form audio - expose beam_size / condition_on_previous_text / temperature

3 participants