feat(audio): unify transcription providers and add local Whisper support#1
Open
hussein1362 wants to merge 244 commits into
Open
feat(audio): unify transcription providers and add local Whisper support#1hussein1362 wants to merge 244 commits into
hussein1362 wants to merge 244 commits into
Conversation
Prevent proactive compaction from archiving sessions that have an in-flight agent task, avoiding mid-turn context truncation when a task runs longer than the idle TTL.
…empty request
When a subagent result is injected with current_role="assistant",
_enforce_role_alternation drops the trailing assistant message, leaving
only the system prompt. Providers like Zhipu/GLM reject such requests
with error 1214 ("messages parameter invalid"). Now the last popped
assistant message is recovered as a user message when no user/tool
messages remain.
Remove two debug log lines that fire on every idle channel check: - "scheduling archival" (logged before knowing if there's work) - "skipping, no un-consolidated messages" (the common no-op path) The meaningful "archived" info log (only on real work) is preserved.
Three improvements to Dream's memory consolidation: 1. Per-line git-blame age annotations: MEMORY.md lines get `← Nd` suffixes (N>14) from dulwich annotate. SOUL.md/USER.md excluded as permanent. LLM uses content judgment, not just age, to decide what to prune. 2. Dedup-aware Phase 1 prompt: reframed as dual-task (extract facts + deduplicate existing files) with explicit redundancy patterns to scan for. Validated through 20 experiments (exp-002 prompt + max_iter=15 was best, averaging -1643 chars/5.4% compression per run). 3. Phase 1 analysis as commit body: dream git commits now include the full Phase 1 analysis for transparency via /dream-log. 4. max_iterations raised from 10 to 15: 30% improvement over 10 with no risk; 20 showed diminishing returns (exp-020: -701 vs exp-017: -1643).
Follow-up to HKUDS#3212, fully backward compatible: - Extract the 14-day staleness threshold as `_STALE_THRESHOLD_DAYS` module constant and pass it into the Phase 1 prompt template as `{{ stale_threshold_days }}`. The number lived in three places before (code threshold, prompt instruction, docstring); now there is one. - Add `DreamConfig.annotate_line_ages` (default True = current behavior) and propagate it through `Dream.__init__` and the gateway wiring in cli/commands.py. Gives users a knob to disable the feature without a code patch if an LLM reacts poorly to the `← Nd` suffix. - Harden `_annotate_with_ages` against dirty working trees: when HEAD blob line count disagrees with the working-tree content length, skip annotation entirely instead of assigning ages to the wrong lines. The previous `i >= len(ages)` guard only handled one direction of the mismatch. - Inline-comment the `max_iterations` 10→15 bump with a pointer to exp002 so future blame has context. - Add 4 regression tests: end-to-end `← 30d` reaches prompt, 14/15 threshold boundary, `annotate_line_ages=False` bypasses git entirely (verified via `assert_not_called`), length-mismatch defense, and template-var rendering. Made-with: Cursor
Complete the symmetry left by HKUDS#3214: ChannelManager._resolve_transcription_base already resolves providers.openai.api_base, but BaseChannel.transcribe_audio instantiated OpenAITranscriptionProvider without forwarding it, and the provider __init__ did not accept the parameter. Self-hosted OpenAI-compatible Whisper endpoints (LiteLLM, vLLM, etc.) configured via config.json were therefore ignored for the OpenAI backend. - OpenAITranscriptionProvider.__init__ now accepts api_base with env fallback (OPENAI_TRANSCRIPTION_BASE_URL) matching the Groq pattern. - BaseChannel.transcribe_audio forwards self.transcription_api_base to OpenAI. - Tests mirror the existing Groq coverage: manager propagation for provider "openai", BaseChannel-to-provider argument passing, and provider default vs override for api_url. Fully backward-compatible: when api_base is None and the env var is unset, the default https://api.openai.com/v1/audio/transcriptions is used. Refs HKUDS#3213, follow-up to HKUDS#3214.
Follow-ups from review of HKUDS#3194: - ci.yml: drop unconditional --ignore=tests/channels/test_matrix_channel.py. That test file already calls pytest.importorskip("nio") at module top, so it self-skips on Windows (where nio isn't installed) without also hiding 62 tests from Linux CI. - filesystem.py: hoist `import os` to the module top and drop the duplicate inline import in ReadFileTool.execute. Document the CRLF->LF normalization as intentional (primarily a Windows UX fix so downstream StrReplace/Grep match consistently regardless of where the file was written). - test_read_enhancements.py: lock down two new behaviors * TestFileStateHashFallback: check_read warns when content changes but mtime is unchanged (coarse-mtime filesystems on Windows). * TestReadFileLineEndingNormalization: ReadFileTool strips CRLF and preserves LF-only files untouched. - test_tool_validation.py: restore list2cmdline/shlex.quote in test_exec_head_tail_truncation. The temp_path-based form was correct, but dropping the quoting broke on any Windows path containing spaces (e.g. C:\Users\John Doe\...). CI runners happen not to have spaces so this slipped through. Tests: 1993 passed locally. Made-with: Cursor
HKUDS#3194 adds `; sys_platform != 'win32'` markers to `matrix-nio[e2e]` so `pip install nanobot-ai[matrix]` no longer fails on Windows — but it also no longer installs matrix-nio there. Without this note, Windows users get a silent half-install and discover the limitation only when the channel crashes at startup. Made-with: Cursor
…or 1214
When _snip_history truncates the message history and the only user message
ends up outside the kept window, providers like GLM reject the resulting
system→assistant sequence with error 1214 ("messages 参数非法").
Two-layer fix:
1. _snip_history now walks backwards through non_system messages to recover
the nearest user message when none exists in the kept window.
2. _enforce_role_alternation inserts a synthetic user message
"(conversation continued)" when the first non-system message is a bare
assistant (no tool_calls), serving as a safety net for any edge cases
that slip through.
Co-authored-by: darlingbud <darlingbud@users.noreply.github.com>
…case test - Extract synthetic user message string to module-level constant - Tighten comments in _snip_history recovery branch - Strengthen no-user edge case test to verify safety net interaction
Skip inbound emails that come from the bot's own configured addresses so a mailbox wired to the same SMTP/IMAP account does not trigger infinite reply loops.
…elf-address match The original regression only exercised a from_address match with all three identity fields set to the same value, so it couldn't distinguish whether _self_addresses actually picks up smtp_username and imap_username or just collapses on from_address. Add a parametrized test covering: - smtp_username-only match (from_address empty, imap_username different) — simulates SMTP relays that rewrite outbound From to the login identity. - imap_username-only match — simulates mailbox-identity setups. - Case-insensitive match — inbound From arriving upper-cased must still hit. No production code changes. Made-with: Cursor
The previous setuptools.backends._legacy:_Backend has been removed in Python 3.14 and newer setuptools, causing 'Cannot import setuptools.backends.legacy' error. Using hatchling (same as main project) ensures compatibility across Python versions. Closes HKUDS#3188
The PyPI package `nanobot` is a different project ("Minimalist robot
navigation framework"), not this one. This project publishes as
`nanobot-ai` (see pyproject.toml). Following the guide as-written would
pull down the wrong package — flagged by vansatchen in HKUDS#3188.
Same toml block as the build-backend fix, one-word change.
Made-with: Cursor
When chat_with_retry returns an error response (finish_reason='error') instead of raising an exception, archive() previously treated the error message as a valid summary and wrote it to history.jsonl, while the original session data was already cleared by /new — causing irreversible data loss. Fix: check finish_reason after the LLM call and raise RuntimeError on error responses, which naturally falls through to the existing raw_archive fallback. This preserves the original messages in history.jsonl instead of losing them. Fixes HKUDS#3244
Two small follow-ups to the guard:
1. Fix the should_execute_tools docstring so it matches the actual code.
The previous version said "Only execute when finish_reason explicitly
signals tool intent" but the code also accepts finish_reason == "stop".
Explain why (some compliant providers emit "stop" with legitimate tool
calls — openai_compat_provider.py already mirrors this at lines ~633 /
~678 where ("tool_calls", "stop") are both treated as the terminal
tool-call state). Without this, a strict "tool_calls"-only guard would
regress 15 existing runner tests that construct LLMResponse with
tool_calls but no explicit finish_reason (default = "stop").
2. Add tests/providers/test_llm_response.py. This locks the three cases:
- no tool calls -> never executes
- tool calls + "tool_calls"/stop -> executes
- tool calls + refusal / content_filter / error / length / ... -> blocked
These are exactly the boundary cases the HKUDS#3220 fix is about; without a
test here a future refactor could silently revert the guard.
Body + tests only, no behavior change beyond the existing PR's intent.
Made-with: Cursor
Made-with: Cursor
The earlier commits picked up a large amount of Black-style reformatting (multi-line frozenset / keyword-arg wrapping / docstring blanks / removed parens) on top of the actual guard fix. @chengyongru flagged it; the first pass reverted some but not all. This restores nanobot/providers/base.py, runner.py, heartbeat/service.py, and utils/evaluator.py to origin/main and reapplies only the guard logic: - base.py: add should_execute_tools property - runner.py / heartbeat/service.py / utils/evaluator.py: route through it + log a warning when has_tool_calls but finish_reason is anomalous Net diff vs main is now +87/-4 (was +211/-102) — roughly 30 lines of real logic, which is what the PR is actually about. Behavior unchanged from previous HEAD; full suite still 2014 passed. Made-with: Cursor
Previously the JSON schema only required "action" but the runtime rejected empty messages, causing LLM retry loops. Making "message" required in the schema prevents the mismatch, and the improved error message guides the LLM to retry with the correct parameters. Fixes HKUDS#3113 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…emove callable The previous patch promoted `message` into top-level `required`, which solved the `add` loop but broke `list` and `remove`: `ToolRegistry.prepare_call` enforces `required` via `validate_params`, so `cron(action="list")` and `cron(action="remove", job_id=...)` — both documented in `SKILL.md` — started failing schema validation with the same "missing required message" shape that HKUDS#3113 describes for `add`. Instead: - Keep `required=["action"]` so `list`/`remove` stay callable. - Prefix `message`'s description with `REQUIRED when action='add'.` and `job_id`'s with `REQUIRED when action='remove'.` so LLMs see the real per-action contract up front. - Keep the improved runtime error message from the previous commit for the case an LLM still omits `message` on `add`. Also add `tests/cron/test_cron_tool_schema_contract.py` to lock in: - `list` and `remove` pass schema validation with no `message` - `add` with `message` passes - `add` without `message` surfaces the actionable runtime error - field descriptions carry the REQUIRED hints - top-level `required` stays `["action"]` Existing `tests/cron/test_cron_tool_list.py` cases bypass schema validation by calling `_list_jobs()` / `_remove_job()` directly, which is why CI didn't catch the regression; the new test goes through `ToolRegistry.prepare_call`.
The streaming API currently logs backend exceptions but still emits the same `finish_reason: "stop"` + `[DONE]` terminator used for successful responses. That makes a failed streamed request look successful to OpenAI-compatible clients. This keeps the fix narrow: track whether the stream backend failed and suppress the success terminator in that case. A regression test locks in the expected behavior. Constraint: Keep the non-streaming response path untouched Constraint: Follow up on the known limitation called out during PR HKUDS#3222 review without redesigning the SSE protocol Rejected: Introduce a custom SSE error event shape in the same patch | expands API surface and review scope Confidence: high Scope-risk: narrow Reversibility: clean Directive: If explicit streamed error events are added later, keep them distinct from the success stop+[DONE] terminator to preserve client retry semantics Tested: PYTHONPATH=$PWD pytest -q tests/test_api_stream.py /Users/jh0927/Workspace/nanobot-validation-artifacts-2026-04-18/test_api_stream_error_regression.py Not-tested: Full repository test suite Related: HKUDS#3260 Related: HKUDS#3222
… buried _with_thread_context prepends conversation history to the message content. This turned "/restart" into "Slack thread context...\n\n Current message:\n/restart", which the command router could not match as a priority command. Skip the context enrichment when the stripped text starts with "/". Made-with: Cursor
Keep the root README focused on the main setup path and leave Slack-specific upload permissions in the chat apps guide. Made-with: Cursor
When deployed with Docker and workspace mounted as a volume, sending media files failed because relative paths (e.g. output/image.png) were not resolved against the workspace directory. The process CWD differs from the workspace in containerized environments, causing os.path.isfile checks to fail in channel handlers. Normalize relative media paths at the MessageTool entry point using get_workspace_path().
Made-with: Cursor
…stem-channel branch Builds on PR HKUDS#3463 (commit 038a140), which introduced metadata and session_key parameters through _LoopHook and _set_tool_context for the cron and message tools. Three downstream gaps remained: 1. _set_tool_context's body still computes effective_key from channel:chat_id and passes that to spawn, even when the caller provides a thread-scoped session_key. The new parameter is wired in for cron/message but spawn dispatch ignores it. Result: subagent announces from threaded callers carry a channel-only session_key_override, dropping thread_ts. 2. _process_message's system-channel branch loads the session via key = f"{channel}:{chat_id}", ignoring msg.session_key_override. So even when the announce InboundMessage carries the right override (after fix 1), the consumer side discards it and routes to the channel-level session. 3. The OutboundMessage returned from the system-channel branch has no metadata, so slack's outbound dispatcher has no thread_ts to use and posts the LLM's reply to the channel top-level rather than the originating thread. This change closes all three gaps with three small edits in loop.py. Behavior change: - Slack channels with reply_in_thread: true: subagent announces and follow-up replies now arrive in the originating thread session instead of leaking into the channel-level session. - Other channels constructing thread-scoped session keys (matrix threads, telegram thread mode, etc.): the session-loading and effective-key fixes apply identically since they're platform-agnostic. The outbound thread_ts reconstruction is slack-specific by virtue of the session-key format slack uses; other channels would benefit from the same pattern but are out of scope for this PR. - Unified session mode: no change. Falls back to UNIFIED_SESSION_KEY when session_key is not provided. - CLI / non-channel callers: no change. They don't pass session_key and the fallback to f"{channel}:{chat_id}" matches prior behavior. Reproducer (slack with reply_in_thread: true): 1. From a slack thread, send a message that triggers a subagent spawn. 2. Before fix: announce lands in slack:<channel>.jsonl session, parent agent in the thread never sees the completion event, eventual reply (if any) posts to the channel top-level, not the thread. 3. After fix: announce lands in slack:<channel>:<thread_ts>.jsonl, parent agent in the thread responds within seconds, reply posts in the thread.
Made-with: Cursor
Slack inbound events with subtype=file_share were silently dropped, so nanobot never saw messages that included attachments. Allow file_share through, download Slack-private files using the bot token into the local media dir, and pass them to the agent as media paths plus a "[file: name]" / "[image: name]" placeholder in the content. Reject responses that look like Slack's login HTML so an auth page is never saved as if it were the user's file. Document the required files:read scope alongside files:write so installs that read attachments are not quietly missing the permission.
Past assistant turns in history were prefixed with "[Message Time: ...]" just like user turns. The model treated these as in-context demos and started prefixing its own replies with the same marker, leaking metadata to the user. Prompt-level warnings could not beat dozens of prior assistant samples. Annotate only user turns and proactive deliveries (_channel_delivery=True, i.e. cron / heartbeat pushes whose timing is the whole point and which are too infrequent to act as demos). Adjacent user-side timestamps still pin every normal assistant reply for relative-time reasoning. The now-redundant identity.md warning is removed along with the demonstration source.
Resolve the MSTeams stale-reference cleanup conflict by keeping the PR's locked, atomic sidecar-meta implementation and aligning the merged test expectation locally. Made-with: Cursor
The PR stores ref freshness in the metadata sidecar, so the merged main test should assert updated_at there instead of in the refs payload. Made-with: Cursor
Preserve main's timestamp/tool-context replay semantics while keeping the PR's session history and file-cap budgets. Made-with: Cursor
…ed MSTeams session fix: Automatically clean up unsupported or expired MSTeams session
Move sessionHistoryMaxMessages, sessionHistoryMaxTokens, and sessionFileMaxMessages out of user-facing config into internal constants (HISTORY_MAX_MESSAGES=120, FILE_MAX_MESSAGES=2000). - Remove 3 fields from AgentDefaults and config pipeline - Sink enforce_file_cap into Session (was AgentLoop) - Auto-derive token budget from context window (was configurable) - Net -113 lines across 7 files; 723 tests green Made-with: Cursor
…s for history lifecycle feat(session): enforce replay/file-cap invariants for history lifecycle
Made-with: Cursor
…solation and allowlist enforcement fix(discord): full thread support with session isolation and allowlist enforcement
…lback in delivery
Three failure modes addressed:
1. Model reflects HEARTBEAT.md instructions back as output instead of
executing them ("HEARTBEAT.md has active tasks listed...")
2. Model narrates decision logic ("Best judgment call: stay quiet")
3. Model produces empty output for silence, runner treats it as failure,
finalization retry generates "couldn't produce a final answer" which
gets delivered to the user
Changes:
- Add _is_deliverable() pre-filter in HeartbeatService._tick() that catches
finalization fallback messages and leaked reasoning patterns before they
reach the evaluator
- Wrap Phase 2 task input with a delivery-awareness preamble telling the
model its output goes directly to the user's messaging app
- Add meta-reasoning suppression criterion to evaluator template
No changes to agent/loop.py, runner.py, providers, or config schema.
Adds /history [n] to display the last N user/assistant messages from the current session (default 10, max 50). - Tool and system messages are filtered out for readability - Long messages are truncated to 200 characters with an ellipsis - Multimodal content (image blocks) is collapsed to its text parts - Invalid count argument returns a usage hint - /history n uses prefix routing; /history uses exact routing Also registers /history in build_help_text().
Made-with: Cursor
Made-with: Cursor
Use a provider capability name that describes user-visible progress delta support instead of the runner implementation detail. Made-with: Cursor
Merge GroqTranscriptionProvider and OpenAITranscriptionProvider into a single WhisperTranscriptionProvider that works with any OpenAI-compatible /v1/audio/transcriptions endpoint. Key changes: - **Unified provider**: WhisperTranscriptionProvider handles groq, openai, and local (whisper.cpp, faster-whisper, LocalAI) with provider-specific defaults for API base URL and model name. - **Local Whisper support**: provider='local' with configurable api_base pointing to any local Whisper server. No API key required. - **TranscriptionConfig**: Proper Pydantic config model in schema.py with validation (language pattern, max_duration_seconds bounds). - **Graceful failure**: Channels now show user-facing messages when transcription is unavailable instead of silently passing raw audio paths. Added transcription_available property to BaseChannel. - **Startup warnings**: ChannelManager logs a warning when voice-capable channels are enabled but transcription is not properly configured. - **Config resolution**: Manager builds TranscriptionConfig by merging the new typed block with legacy flat fields and provider-section API keys. - **Backward compatible**: Old GroqTranscriptionProvider and OpenAITranscriptionProvider still importable as thin subclass aliases. Legacy flat transcription_provider/transcription_language fields still work. Existing configs require zero changes. - **Duration guard**: Rejects audio files exceeding max_duration_seconds (estimated by file size heuristic) before making API calls. - **Tests**: 35 new tests covering provider defaults, availability checks, transcription flow, backward-compat aliases, config validation, channel integration, and error paths.
…path so mocks work The unified transcribe_audio() refactor replaced the provider-specific instantiation (GroqTranscriptionProvider / OpenAITranscriptionProvider) with a single WhisperTranscriptionProvider call. This broke tests that patch the specific classes, because the patches were never reached. For the legacy flat-attribute path, restore dispatch to the concrete provider class via the module object so unittest.mock.patch.object stays effective. The structured _transcription_config path is unaffected and retains its is_available guard. Fixes CI failures: tests/channels/test_channel_plugins.py::test_base_channel_passes_api_base_to_openai_transcription_provider tests/channels/test_channel_plugins.py::test_base_channel_passes_language_to_groq_transcription_provider
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Merge the two nearly-identical transcription providers into a single
WhisperTranscriptionProviderand add first-class support for local Whisper servers (whisper.cpp, faster-whisper, LocalAI, Ollama).Problem
Changes
nanobot/providers/transcription.pyWhisperTranscriptionProvider— single unified class handlinggroq,openai, andlocalproviders_PROVIDER_DEFAULTSdictis_available/unavailable_reasonproperties for clean availability checksapi_basepointing to your Whisper serverGroqTranscriptionProvider/OpenAITranscriptionProviderkept as deprecated subclass aliasesnanobot/config/schema.pyTranscriptionConfig— proper Pydantic model with validationChannelsConfigastranscriptionfieldnanobot/channels/base.pytranscription_availableproperty — checks if transcription is readytranscribe_audio()uses unified provider via typed config (falls back to legacy)nanobot/channels/manager.py_build_transcription_config()— merges typed block + legacy fields + provider-section keys_warn_transcription_unconfigured()— startup warning for voice channelsnanobot/channels/telegram.py/whatsapp.pyTests (35 new)
Local Whisper Setup
{ "channels": { "transcription": { "provider": "local", "api_base": "http://localhost:8080/v1/audio/transcriptions", "model": "large-v3" } } }Backward Compatibility