Skip to content

feat(audio): unify transcription providers and add local Whisper support#1

Open
hussein1362 wants to merge 244 commits into
mainfrom
feat/voice-transcription-first-class
Open

feat(audio): unify transcription providers and add local Whisper support#1
hussein1362 wants to merge 244 commits into
mainfrom
feat/voice-transcription-first-class

Conversation

@hussein1362
Copy link
Copy Markdown
Owner

Summary

Merge the two nearly-identical transcription providers into a single WhisperTranscriptionProvider and add first-class support for local Whisper servers (whisper.cpp, faster-whisper, LocalAI, Ollama).

Problem

  1. Silent failure — when API key is missing, users get raw audio paths with no explanation
  2. Only two hardcoded providers (Groq, OpenAI) — no way to use local Whisper
  3. No config validation or startup warning for missing transcription config
  4. Near-identical code — both providers hit the same OpenAI-compatible endpoint
  5. Loose config — flat attributes on BaseChannel instead of proper typed config

Changes

nanobot/providers/transcription.py

  • WhisperTranscriptionProvider — single unified class handling groq, openai, and local providers
  • Provider-specific defaults (API base URL, model name) via _PROVIDER_DEFAULTS dict
  • is_available / unavailable_reason properties for clean availability checks
  • Duration guard — rejects oversized audio files before making API calls
  • Local provider: no API key required, just api_base pointing to your Whisper server
  • GroqTranscriptionProvider / OpenAITranscriptionProvider kept as deprecated subclass aliases

nanobot/config/schema.py

  • TranscriptionConfig — proper Pydantic model with validation
  • Added to ChannelsConfig as transcription field
  • Legacy flat fields preserved for backward compat

nanobot/channels/base.py

  • transcription_available property — checks if transcription is ready
  • transcribe_audio() uses unified provider via typed config (falls back to legacy)

nanobot/channels/manager.py

  • _build_transcription_config() — merges typed block + legacy fields + provider-section keys
  • _warn_transcription_unconfigured() — startup warning for voice channels

nanobot/channels/telegram.py / whatsapp.py

  • User-facing messages when transcription unavailable

Tests (35 new)

  • Provider defaults, availability, mocked HTTP, language hints, local no-auth, errors, duration guard, aliases
  • Channel availability, transcribe flow, disabled/unavailable config, graceful failure, config validation

Local Whisper Setup

{
  "channels": {
    "transcription": {
      "provider": "local",
      "api_base": "http://localhost:8080/v1/audio/transcriptions",
      "model": "large-v3"
    }
  }
}

Backward Compatibility

  • Existing flat-field configs keep working
  • Old provider class imports still work (deprecated aliases)
  • Zero-change upgrade for existing users
  • All existing tests pass

chengyongru and others added 30 commits April 13, 2026 12:01
Prevent proactive compaction from archiving sessions that have an
in-flight agent task, avoiding mid-turn context truncation when a
task runs longer than the idle TTL.
…empty request

When a subagent result is injected with current_role="assistant",
_enforce_role_alternation drops the trailing assistant message, leaving
only the system prompt. Providers like Zhipu/GLM reject such requests
with error 1214 ("messages parameter invalid"). Now the last popped
assistant message is recovered as a user message when no user/tool
messages remain.
Remove two debug log lines that fire on every idle channel check:
- "scheduling archival" (logged before knowing if there's work)
- "skipping, no un-consolidated messages" (the common no-op path)

The meaningful "archived" info log (only on real work) is preserved.
Three improvements to Dream's memory consolidation:

1. Per-line git-blame age annotations: MEMORY.md lines get `← Nd` suffixes
   (N>14) from dulwich annotate. SOUL.md/USER.md excluded as permanent.
   LLM uses content judgment, not just age, to decide what to prune.

2. Dedup-aware Phase 1 prompt: reframed as dual-task (extract facts +
   deduplicate existing files) with explicit redundancy patterns to scan for.
   Validated through 20 experiments (exp-002 prompt + max_iter=15 was best,
   averaging -1643 chars/5.4% compression per run).

3. Phase 1 analysis as commit body: dream git commits now include the full
   Phase 1 analysis for transparency via /dream-log.

4. max_iterations raised from 10 to 15: 30% improvement over 10 with no
   risk; 20 showed diminishing returns (exp-020: -701 vs exp-017: -1643).
Follow-up to HKUDS#3212, fully backward compatible:

- Extract the 14-day staleness threshold as `_STALE_THRESHOLD_DAYS` module
  constant and pass it into the Phase 1 prompt template as
  `{{ stale_threshold_days }}`. The number lived in three places before
  (code threshold, prompt instruction, docstring); now there is one.
- Add `DreamConfig.annotate_line_ages` (default True = current behavior)
  and propagate it through `Dream.__init__` and the gateway wiring in
  cli/commands.py. Gives users a knob to disable the feature without a
  code patch if an LLM reacts poorly to the `← Nd` suffix.
- Harden `_annotate_with_ages` against dirty working trees: when HEAD
  blob line count disagrees with the working-tree content length, skip
  annotation entirely instead of assigning ages to the wrong lines. The
  previous `i >= len(ages)` guard only handled one direction of the
  mismatch.
- Inline-comment the `max_iterations` 10→15 bump with a pointer to
  exp002 so future blame has context.
- Add 4 regression tests: end-to-end `← 30d` reaches prompt, 14/15
  threshold boundary, `annotate_line_ages=False` bypasses git entirely
  (verified via `assert_not_called`), length-mismatch defense, and
  template-var rendering.

Made-with: Cursor
Complete the symmetry left by HKUDS#3214: ChannelManager._resolve_transcription_base
already resolves providers.openai.api_base, but BaseChannel.transcribe_audio
instantiated OpenAITranscriptionProvider without forwarding it, and the provider
__init__ did not accept the parameter. Self-hosted OpenAI-compatible Whisper
endpoints (LiteLLM, vLLM, etc.) configured via config.json were therefore
ignored for the OpenAI backend.

- OpenAITranscriptionProvider.__init__ now accepts api_base with env fallback
  (OPENAI_TRANSCRIPTION_BASE_URL) matching the Groq pattern.
- BaseChannel.transcribe_audio forwards self.transcription_api_base to OpenAI.
- Tests mirror the existing Groq coverage: manager propagation for provider
  "openai", BaseChannel-to-provider argument passing, and provider default vs
  override for api_url.

Fully backward-compatible: when api_base is None and the env var is unset,
the default https://api.openai.com/v1/audio/transcriptions is used.

Refs HKUDS#3213, follow-up to HKUDS#3214.
Follow-ups from review of HKUDS#3194:

- ci.yml: drop unconditional --ignore=tests/channels/test_matrix_channel.py.
  That test file already calls pytest.importorskip("nio") at module top, so
  it self-skips on Windows (where nio isn't installed) without also hiding
  62 tests from Linux CI.

- filesystem.py: hoist `import os` to the module top and drop the duplicate
  inline import in ReadFileTool.execute. Document the CRLF->LF normalization
  as intentional (primarily a Windows UX fix so downstream StrReplace/Grep
  match consistently regardless of where the file was written).

- test_read_enhancements.py: lock down two new behaviors
  * TestFileStateHashFallback: check_read warns when content changes but
    mtime is unchanged (coarse-mtime filesystems on Windows).
  * TestReadFileLineEndingNormalization: ReadFileTool strips CRLF and
    preserves LF-only files untouched.

- test_tool_validation.py: restore list2cmdline/shlex.quote in
  test_exec_head_tail_truncation. The temp_path-based form was correct,
  but dropping the quoting broke on any Windows path containing spaces
  (e.g. C:\Users\John Doe\...). CI runners happen not to have spaces so
  this slipped through.

Tests: 1993 passed locally.
Made-with: Cursor
HKUDS#3194 adds `; sys_platform != 'win32'` markers to `matrix-nio[e2e]` so
`pip install nanobot-ai[matrix]` no longer fails on Windows — but it also
no longer installs matrix-nio there. Without this note, Windows users get
a silent half-install and discover the limitation only when the channel
crashes at startup.

Made-with: Cursor
…or 1214

When _snip_history truncates the message history and the only user message
ends up outside the kept window, providers like GLM reject the resulting
system→assistant sequence with error 1214 ("messages 参数非法").

Two-layer fix:
1. _snip_history now walks backwards through non_system messages to recover
   the nearest user message when none exists in the kept window.
2. _enforce_role_alternation inserts a synthetic user message
   "(conversation continued)" when the first non-system message is a bare
   assistant (no tool_calls), serving as a safety net for any edge cases
   that slip through.

Co-authored-by: darlingbud <darlingbud@users.noreply.github.com>
…case test

- Extract synthetic user message string to module-level constant
- Tighten comments in _snip_history recovery branch
- Strengthen no-user edge case test to verify safety net interaction
Skip inbound emails that come from the bot's own configured addresses so a mailbox wired to the same SMTP/IMAP account does not trigger infinite reply loops.
…elf-address match

The original regression only exercised a from_address match with all three
identity fields set to the same value, so it couldn't distinguish whether
_self_addresses actually picks up smtp_username and imap_username or just
collapses on from_address. Add a parametrized test covering:

- smtp_username-only match (from_address empty, imap_username different) —
  simulates SMTP relays that rewrite outbound From to the login identity.
- imap_username-only match — simulates mailbox-identity setups.
- Case-insensitive match — inbound From arriving upper-cased must still hit.

No production code changes.

Made-with: Cursor
The previous setuptools.backends._legacy:_Backend has been removed in
Python 3.14 and newer setuptools, causing 'Cannot import setuptools.backends.legacy' error.

Using hatchling (same as main project) ensures compatibility across Python versions.

Closes HKUDS#3188
The PyPI package `nanobot` is a different project ("Minimalist robot
navigation framework"), not this one. This project publishes as
`nanobot-ai` (see pyproject.toml). Following the guide as-written would
pull down the wrong package — flagged by vansatchen in HKUDS#3188.

Same toml block as the build-backend fix, one-word change.

Made-with: Cursor
When chat_with_retry returns an error response (finish_reason='error')
instead of raising an exception, archive() previously treated the error
message as a valid summary and wrote it to history.jsonl, while the
original session data was already cleared by /new — causing irreversible
data loss.

Fix: check finish_reason after the LLM call and raise RuntimeError on
error responses, which naturally falls through to the existing raw_archive
fallback. This preserves the original messages in history.jsonl instead
of losing them.

Fixes HKUDS#3244
Two small follow-ups to the guard:

1. Fix the should_execute_tools docstring so it matches the actual code.
   The previous version said "Only execute when finish_reason explicitly
   signals tool intent" but the code also accepts finish_reason == "stop".
   Explain why (some compliant providers emit "stop" with legitimate tool
   calls — openai_compat_provider.py already mirrors this at lines ~633 /
   ~678 where ("tool_calls", "stop") are both treated as the terminal
   tool-call state). Without this, a strict "tool_calls"-only guard would
   regress 15 existing runner tests that construct LLMResponse with
   tool_calls but no explicit finish_reason (default = "stop").

2. Add tests/providers/test_llm_response.py. This locks the three cases:
   - no tool calls                  -> never executes
   - tool calls + "tool_calls"/stop -> executes
   - tool calls + refusal / content_filter / error / length / ... -> blocked

   These are exactly the boundary cases the HKUDS#3220 fix is about; without a
   test here a future refactor could silently revert the guard.

Body + tests only, no behavior change beyond the existing PR's intent.

Made-with: Cursor
The earlier commits picked up a large amount of Black-style reformatting
(multi-line frozenset / keyword-arg wrapping / docstring blanks / removed
parens) on top of the actual guard fix. @chengyongru flagged it; the
first pass reverted some but not all.

This restores nanobot/providers/base.py, runner.py, heartbeat/service.py,
and utils/evaluator.py to origin/main and reapplies only the guard logic:

  - base.py: add should_execute_tools property
  - runner.py / heartbeat/service.py / utils/evaluator.py: route through it
    + log a warning when has_tool_calls but finish_reason is anomalous

Net diff vs main is now +87/-4 (was +211/-102) — roughly 30 lines of real
logic, which is what the PR is actually about.

Behavior unchanged from previous HEAD; full suite still 2014 passed.

Made-with: Cursor
Previously the JSON schema only required "action" but the runtime
rejected empty messages, causing LLM retry loops. Making "message"
required in the schema prevents the mismatch, and the improved error
message guides the LLM to retry with the correct parameters.

Fixes HKUDS#3113

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…emove callable

The previous patch promoted `message` into top-level `required`, which solved
the `add` loop but broke `list` and `remove`: `ToolRegistry.prepare_call`
enforces `required` via `validate_params`, so `cron(action="list")` and
`cron(action="remove", job_id=...)` — both documented in `SKILL.md` — started
failing schema validation with the same "missing required message" shape that
HKUDS#3113 describes for `add`.

Instead:
- Keep `required=["action"]` so `list`/`remove` stay callable.
- Prefix `message`'s description with `REQUIRED when action='add'.` and
  `job_id`'s with `REQUIRED when action='remove'.` so LLMs see the real
  per-action contract up front.
- Keep the improved runtime error message from the previous commit for the
  case an LLM still omits `message` on `add`.

Also add `tests/cron/test_cron_tool_schema_contract.py` to lock in:
  - `list` and `remove` pass schema validation with no `message`
  - `add` with `message` passes
  - `add` without `message` surfaces the actionable runtime error
  - field descriptions carry the REQUIRED hints
  - top-level `required` stays `["action"]`

Existing `tests/cron/test_cron_tool_list.py` cases bypass schema validation by
calling `_list_jobs()` / `_remove_job()` directly, which is why CI didn't catch
the regression; the new test goes through `ToolRegistry.prepare_call`.
The streaming API currently logs backend exceptions but still emits the
same `finish_reason: "stop"` + `[DONE]` terminator used for successful
responses. That makes a failed streamed request look successful to
OpenAI-compatible clients.

This keeps the fix narrow: track whether the stream backend failed and
suppress the success terminator in that case. A regression test locks in
the expected behavior.

Constraint: Keep the non-streaming response path untouched
Constraint: Follow up on the known limitation called out during PR HKUDS#3222 review without redesigning the SSE protocol
Rejected: Introduce a custom SSE error event shape in the same patch | expands API surface and review scope
Confidence: high
Scope-risk: narrow
Reversibility: clean
Directive: If explicit streamed error events are added later, keep them distinct from the success stop+[DONE] terminator to preserve client retry semantics
Tested: PYTHONPATH=$PWD pytest -q tests/test_api_stream.py /Users/jh0927/Workspace/nanobot-validation-artifacts-2026-04-18/test_api_stream_error_regression.py
Not-tested: Full repository test suite
Related: HKUDS#3260
Related: HKUDS#3222
Re-bin and others added 30 commits April 27, 2026 12:45
… buried

_with_thread_context prepends conversation history to the message
content.  This turned "/restart" into "Slack thread context...\n\n
Current message:\n/restart", which the command router could not match
as a priority command.  Skip the context enrichment when the stripped
text starts with "/".

Made-with: Cursor
Keep the root README focused on the main setup path and leave Slack-specific upload permissions in the chat apps guide.

Made-with: Cursor
When deployed with Docker and workspace mounted as a volume, sending
media files failed because relative paths (e.g. output/image.png) were
not resolved against the workspace directory. The process CWD differs
from the workspace in containerized environments, causing os.path.isfile
checks to fail in channel handlers. Normalize relative media paths at
the MessageTool entry point using get_workspace_path().
…stem-channel branch

Builds on PR HKUDS#3463 (commit 038a140), which introduced metadata and
session_key parameters through _LoopHook and _set_tool_context for the
cron and message tools. Three downstream gaps remained:

1. _set_tool_context's body still computes effective_key from
   channel:chat_id and passes that to spawn, even when the caller
   provides a thread-scoped session_key. The new parameter is wired in
   for cron/message but spawn dispatch ignores it. Result: subagent
   announces from threaded callers carry a channel-only
   session_key_override, dropping thread_ts.

2. _process_message's system-channel branch loads the session via
   key = f"{channel}:{chat_id}", ignoring msg.session_key_override.
   So even when the announce InboundMessage carries the right override
   (after fix 1), the consumer side discards it and routes to the
   channel-level session.

3. The OutboundMessage returned from the system-channel branch has no
   metadata, so slack's outbound dispatcher has no thread_ts to use and
   posts the LLM's reply to the channel top-level rather than the
   originating thread.

This change closes all three gaps with three small edits in loop.py.

Behavior change:
- Slack channels with reply_in_thread: true: subagent announces and
  follow-up replies now arrive in the originating thread session
  instead of leaking into the channel-level session.
- Other channels constructing thread-scoped session keys (matrix
  threads, telegram thread mode, etc.): the session-loading and
  effective-key fixes apply identically since they're platform-agnostic.
  The outbound thread_ts reconstruction is slack-specific by virtue of
  the session-key format slack uses; other channels would benefit from
  the same pattern but are out of scope for this PR.
- Unified session mode: no change. Falls back to UNIFIED_SESSION_KEY
  when session_key is not provided.
- CLI / non-channel callers: no change. They don't pass session_key
  and the fallback to f"{channel}:{chat_id}" matches prior behavior.

Reproducer (slack with reply_in_thread: true):
1. From a slack thread, send a message that triggers a subagent spawn.
2. Before fix: announce lands in slack:<channel>.jsonl session,
   parent agent in the thread never sees the completion event,
   eventual reply (if any) posts to the channel top-level, not the
   thread.
3. After fix: announce lands in slack:<channel>:<thread_ts>.jsonl,
   parent agent in the thread responds within seconds, reply posts in
   the thread.
Slack inbound events with subtype=file_share were silently dropped, so
nanobot never saw messages that included attachments. Allow file_share
through, download Slack-private files using the bot token into the
local media dir, and pass them to the agent as media paths plus a
"[file: name]" / "[image: name]" placeholder in the content. Reject
responses that look like Slack's login HTML so an auth page is never
saved as if it were the user's file. Document the required files:read
scope alongside files:write so installs that read attachments are not
quietly missing the permission.
Past assistant turns in history were prefixed with "[Message Time: ...]"
just like user turns. The model treated these as in-context demos and
started prefixing its own replies with the same marker, leaking
metadata to the user. Prompt-level warnings could not beat dozens of
prior assistant samples.

Annotate only user turns and proactive deliveries
(_channel_delivery=True, i.e. cron / heartbeat pushes whose timing is
the whole point and which are too infrequent to act as demos). Adjacent
user-side timestamps still pin every normal assistant reply for
relative-time reasoning. The now-redundant identity.md warning is
removed along with the demonstration source.
Resolve the MSTeams stale-reference cleanup conflict by keeping the PR's locked, atomic sidecar-meta implementation and aligning the merged test expectation locally.

Made-with: Cursor
The PR stores ref freshness in the metadata sidecar, so the merged main test should assert updated_at there instead of in the refs payload.

Made-with: Cursor
Preserve main's timestamp/tool-context replay semantics while keeping the PR's session history and file-cap budgets.

Made-with: Cursor
…ed MSTeams session

fix: Automatically clean up unsupported or expired MSTeams session
Move sessionHistoryMaxMessages, sessionHistoryMaxTokens, and
sessionFileMaxMessages out of user-facing config into internal
constants (HISTORY_MAX_MESSAGES=120, FILE_MAX_MESSAGES=2000).

- Remove 3 fields from AgentDefaults and config pipeline
- Sink enforce_file_cap into Session (was AgentLoop)
- Auto-derive token budget from context window (was configurable)
- Net -113 lines across 7 files; 723 tests green

Made-with: Cursor
…s for history lifecycle

feat(session): enforce replay/file-cap invariants for history lifecycle
…solation and allowlist enforcement

fix(discord): full thread support with session isolation and allowlist enforcement
…lback in delivery

Three failure modes addressed:

1. Model reflects HEARTBEAT.md instructions back as output instead of
   executing them ("HEARTBEAT.md has active tasks listed...")
2. Model narrates decision logic ("Best judgment call: stay quiet")
3. Model produces empty output for silence, runner treats it as failure,
   finalization retry generates "couldn't produce a final answer" which
   gets delivered to the user

Changes:
- Add _is_deliverable() pre-filter in HeartbeatService._tick() that catches
  finalization fallback messages and leaked reasoning patterns before they
  reach the evaluator
- Wrap Phase 2 task input with a delivery-awareness preamble telling the
  model its output goes directly to the user's messaging app
- Add meta-reasoning suppression criterion to evaluator template

No changes to agent/loop.py, runner.py, providers, or config schema.
Adds /history [n] to display the last N user/assistant messages from
the current session (default 10, max 50).

- Tool and system messages are filtered out for readability
- Long messages are truncated to 200 characters with an ellipsis
- Multimodal content (image blocks) is collapsed to its text parts
- Invalid count argument returns a usage hint
- /history n uses prefix routing; /history uses exact routing

Also registers /history in build_help_text().
Use a provider capability name that describes user-visible progress delta support instead of the runner implementation detail.

Made-with: Cursor
Merge GroqTranscriptionProvider and OpenAITranscriptionProvider into a
single WhisperTranscriptionProvider that works with any OpenAI-compatible
/v1/audio/transcriptions endpoint.

Key changes:

- **Unified provider**: WhisperTranscriptionProvider handles groq, openai,
  and local (whisper.cpp, faster-whisper, LocalAI) with provider-specific
  defaults for API base URL and model name.

- **Local Whisper support**: provider='local' with configurable api_base
  pointing to any local Whisper server. No API key required.

- **TranscriptionConfig**: Proper Pydantic config model in schema.py with
  validation (language pattern, max_duration_seconds bounds).

- **Graceful failure**: Channels now show user-facing messages when
  transcription is unavailable instead of silently passing raw audio paths.
  Added transcription_available property to BaseChannel.

- **Startup warnings**: ChannelManager logs a warning when voice-capable
  channels are enabled but transcription is not properly configured.

- **Config resolution**: Manager builds TranscriptionConfig by merging the
  new typed block with legacy flat fields and provider-section API keys.

- **Backward compatible**: Old GroqTranscriptionProvider and
  OpenAITranscriptionProvider still importable as thin subclass aliases.
  Legacy flat transcription_provider/transcription_language fields still
  work. Existing configs require zero changes.

- **Duration guard**: Rejects audio files exceeding max_duration_seconds
  (estimated by file size heuristic) before making API calls.

- **Tests**: 35 new tests covering provider defaults, availability checks,
  transcription flow, backward-compat aliases, config validation, channel
  integration, and error paths.
…path so mocks work

The unified transcribe_audio() refactor replaced the provider-specific
instantiation (GroqTranscriptionProvider / OpenAITranscriptionProvider)
with a single WhisperTranscriptionProvider call. This broke tests that
patch the specific classes, because the patches were never reached.

For the legacy flat-attribute path, restore dispatch to the concrete
provider class via the module object so unittest.mock.patch.object stays
effective. The structured _transcription_config path is unaffected and
retains its is_available guard.

Fixes CI failures:
  tests/channels/test_channel_plugins.py::test_base_channel_passes_api_base_to_openai_transcription_provider
  tests/channels/test_channel_plugins.py::test_base_channel_passes_language_to_groq_transcription_provider
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.