Skip to content

AtomS3R long-reply TTS: server chunking + PTT barge-in (+ face tuning, codex PTT preserved)#52

Closed
amariichi wants to merge 9 commits into
mainfrom
atoms3r-face-audio-tuning
Closed

AtomS3R long-reply TTS: server chunking + PTT barge-in (+ face tuning, codex PTT preserved)#52
amariichi wants to merge 9 commits into
mainfrom
atoms3r-face-audio-tuning

Conversation

@amariichi
Copy link
Copy Markdown
Owner

Summary

Fixes the AtomS3R "long agent reply = mouth-only / loud static" problem and tunes the face, on top of the preserved codex first-pass PTT firmware.

  • face_renderer: flatter closed-eye arc, whole-face recentred.
  • TTS audio HTTP reference store (tts_audio_store) for non-browser sinks.
  • Step 1 — server chunking: segmentTtsText splits long utterances into ordered sentence-bounded chunks played through a real FIFO; short text unchanged. MH_TTS_CHUNK_MAX_CHARS (operator stack defaults to 24 so each WAV stays under the Atom HTTP ingress cap ~250 KB; 413 otherwise).
  • Step 2 — PTT barge-in: operator ASR upload flushes the chunk queue, interrupts the active chunk, bumps generation, clears the audio store.
  • codex PTT firmware preserved (2579dcc, hardware-validated Milestone 6) as its own commit.
  • Step 3 (Atom-side FIFO) reverted (5b0b3ef): it corrupted playback (loud static); restored to the validated single-play path. Server chunking + small budget keep audio clean on hardware (user-confirmed).

Status

  • face-app: full node --test green (338).
  • Firmware: builds, flashed, user-confirmed clean audio on the real AtomS3R.
  • Known follow-ups (non-blocking, tracked in PLANS_48): proper Atom playback queue rework with on-device serial validation; same-agent sequential append vs replace for rapid multi-sentence face_say.

Coordination

codex is paused (~3h) and its in-flight firmware was committed here unchanged; on resume it should pull/rebase since the base advances.

🤖 Generated with Claude Code

amariichi and others added 9 commits May 16, 2026 23:13
Hardware-validated AtomS3R frontend work:

- firmware: AtomS3R has no internal speaker; use Atomic Echo Base
  (cfg.internal_spk=false, external_speaker.atomic_echo=true),
  set speaker volume 130
- face_renderer: success face uses normal eyes + brown raised "\/"
  brows + closed-mouth smile arc (口パク preserved while speaking);
  blink is a dark downward-convex eyelid arc with randomized 3/4/5 s
  scheduling (1-in-10 quick 0.3 s re-blink); Thinking sweeps the
  pupils left/right
- headroom_transport: Failed background persists until the next state
  event (like Permission) instead of auto-reverting after 8 s
- scripts: default restart-operator-stack-in-place.sh to
  FACE_AUDIO_TARGET=both so the PC->Atom bridge receives tts_audio;
  add atoms3r-http-bridge.mjs and the stackchan-minimal sidecar

Local commit only; not pushed. ExecPlan PLANS_48 updated on disk
(.agent/ is gitignored by repo design, so it is not in this commit).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
- Closed "∪" eyelid: larger radius (15->19) with a narrower sweep
  (25..155 -> 45..135) for a gentler, flatter curve, dropped a touch lower.
- Shift the whole composed face down by kFaceOffsetY (4px) so the head
  looks vertically centered despite the visual weight of the hair.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
- Add tts_audio_store: TTL-bounded server-side store of generated TTS
  audio, exposed over HTTP with a lightweight WS reference payload and
  WAV-duration parsing.
- tts_controller: always stash audio and broadcast a reference; only
  broadcast the base64 body when browser audio is enabled (previously
  audio was dropped entirely when browser audio was off), so the
  AtomS3R/Echo Base bridge and Stack-chan sidecar can fetch it by URL.
- index.js: wire the store (MH_TTS_AUDIO_REF_TTL_MS, default 60s) into
  the HTTP router and controller; clear it on worker stop.
- package.json: add stackchan:run and atoms3r:bridge scripts.
- .gitignore: ignore .venv-platformio/ and esp-web-tools-logs.txt.
- Tests for the store (reference metadata, TTL expiry, WAV duration)
  and the controller reference path.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…fix)

Long agent replies exceeded the AtomS3R firmware's per-utterance
base64/HTTP WAV cap, so the audio was dropped while the mouth kept
animating from the independent tts_mouth stream ("mouth-only, no
sound" on long local-LLM output).

- Add segmentTtsText: JA/EN hard sentence boundaries, late comma soft
  split, greedy packing, default 120 chars (MH_TTS_CHUNK_MAX_CHARS).
  Text <= limit is returned verbatim (unchanged single-chunk path).
- Replace the single `pending` slot with a real ordered FIFO queue;
  one logical utterance occupies it, a newer say flushes the
  remainder, interrupt/auto-interrupt/stop clear the whole queue.
- Each chunk is its own worker `speak` with the parent generation and
  a #k/N utterance/message suffix, dispatched sequentially on
  play_stop, keeping every WAV under the Atom size cap.
- Tests: segmentTtsText units + sequential-dispatch + interrupt-flush;
  full node --test suite green (333).

Step 1 of 3 (Step 2: PTT-clear wiring; Step 3: firmware playback queue).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
When the operator takes the turn, queued/active agent speech should
stop instead of talking over the just-spoken input.

- tts_controller: add flushForBargeIn(reason) — clear the chunk FIFO,
  interrupt + release the active chunk, emit play_stop, advance the
  generation so late worker audio/mouth for the old utterance is
  ignored, and clear the audio store so a memory-constrained sink
  cannot pull a stale chunk.
- operator_asr_proxy: new onBargeIn option, invoked as soon as a
  POST /api/operator/asr audio upload arrives — the earliest
  cross-transport "user took the turn" signal (the Atom posts here
  too; it has no usable WebSocket client). Handler errors are caught
  so ASR still proceeds.
- index.js: wire onBargeIn -> ttsController.flushForBargeIn.
- Tests: controller flush behavior + audio-store clear; proxy
  onBargeIn invocation, non-ASR negative, and throw-safety. Full
  node --test green (338).

Step 2 of 3 (Step 3: AtomS3R firmware playback queue + stop-on-PTT).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Preservation commit of codex's in-flight, build-passing,
hardware-validated PTT firmware (Milestone 6 in PLANS_48): button
hold-to-record, M5.Mic capture via the Atomic Echo Base, 16 kHz mono
WAV wrapping, POST to /api/operator/asr?lang=<asrLanguage>, and
operator_response submission over the authenticated HTTP fallback.

Includes persisted asrLanguage (ja/en) with setup-portal selection,
HeadroomTransport::sendOperatorText(), the HeadroomPtt
record/process/submit module, longer ASR HTTP read timeout, and
ingress/settings/config wiring. `pio run` succeeds (RAM 15.6%,
Flash 36.2%); transcript verified on real hardware per PLANS_48.

Committed by Claude to preserve the work while codex is paused; no
behavioral changes were made to codex's firmware in this commit.

Co-Authored-By: Codex (OpenAI) <noreply@openai.com>
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Server-side sentence chunking delivers an ordered burst of small WAV
refs. The firmware played each one via playOwnedWav, which calls
M5.Speaker.stop() first, so a newly arrived chunk truncated the one
still playing (choppy / only the last chunk audible).

- headroom_audio: add a bounded FIFO (kMaxQueued=8). playBase64Wav /
  playHttpWavRef / playWavBytes route through playOrEnqueue: play
  immediately when idle, else enqueue the owned buffer. loop() starts
  the next queued chunk when the speaker goes idle.
- stop()/stopForRecording() (the PTT-press path) clear the queue and
  free buffers, so buffered agent speech does not resume after the
  user takes the turn.
- busy() stays true while chunks are queued so the face holds the
  Speaking expression across inter-chunk gaps instead of flickering.
- headroom_transport: zero mouthOpen when a chunk's audio is dropped,
  killing phantom 口パク from the independent tts_mouth stream.

Built (RAM 15.7%, Flash 36.3%) and flashed to the real AtomS3R.

Step 3 of 3. Completes the long-TTS chunking + PTT-clear feature
(Step 1 2d26c1c, Step 2 882889e).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Field bug after the chunking work: long replies were still mouth-only
with a burst of static at the end. The atoms3r bridge log showed the
Atom rejecting ~800 KB WAVs with HTTP 413 payload_too_large; the rare
small sentence slipped through and a near-cap one played truncated
(the static). The 120-char chunk default was never calibrated to the
Atom HTTP ingress cap (~250 KB accepted in practice), and one Hermes
sentence (<=120 chars) was not split at all.

restart-operator-stack-in-place.sh now exports MH_TTS_CHUNK_MAX_CHARS
(default 24 ~= ~3 s ~= ~150 KB WAV) into the stack so the Atom-facing
pipeline chunks small enough to be accepted. The global code default
stays 120 for browser/PC. Verified live: long utterance now splits
into ~110-165 KB WAV chunks, all forwarded with no 413.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Once server-side chunking made WAVs small enough to be accepted (no
more 413), every chunk played as loud static with faint voice. The
regression was introduced solely by the Step 3 FIFO; the validated
Milestone 5 single-play path was clean. Serial showed no firmware
rejection, so the WAV was accepted and "played" but corrupted -
consistent with a buffer-lifetime/scheduling bug around the async
M5.Speaker.playWav in the queue.

Restore headroom_audio.{cpp,h} and headroom_transport.cpp to the
validated 2579dcc state, rebuilt and reflashed. Server-side chunking
(Steps 1/2) and the small MH_TTS_CHUNK_MAX_CHARS budget are kept, so
the Atom receives small WAVs played one-at-a-time by the known-good
path. A correct Atom playback queue is deferred to an isolated rework
with on-device serial validation (see PLANS_48).

This reverts the firmware portion of ed6f4bf only.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@amariichi amariichi closed this May 17, 2026
@amariichi amariichi deleted the atoms3r-face-audio-tuning branch May 17, 2026 02:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant