Laughter/reaction detection channel (YAMNet ONNX, no-torch)#43
Laughter/reaction detection channel (YAMNet ONNX, no-torch)#43nmbrthirteen wants to merge 1 commit into
Conversation
Detect laughter, cheering, applause and screaming as a language-independent reaction signal for clip selection. YAMNet runs as a self-contained ONNX graph via onnxruntime (raw 16kHz waveform in, log-mel baked into the model), so it adds no torch/TF dependency and stays on the hermetic native runtime path. - audio_events.py: reaction channel mirroring audio_analyzer's shape, degrades gracefully when the model or onnxruntime is absent - profiles.py: ContentProfile scaffold (podcast/party/action) — detection-time channel weights + candidate source, orthogonal to FormatSpec - reactions flow where energy already does: the packed transcript view gains a 'Laughter & reactions' section (visible to the selector), and the CLI heuristic gains a laughter dimension - ~380x realtime; a 3h video is ~28s of reaction analysis Podcast selection is unchanged by default: reactions are additive and never rewrite existing candidates.
📝 WalkthroughWalkthroughAdds a new audio event (laughter/reaction) detection service using YAMNet ONNX, along with a content-profile module for weighting detection signals. Integrates reaction scoring into CLI clip suggestion, transcript packing, and main.py handlers, adds onnxruntime dependencies, tests, and a design plan document. ChangesAudio Event Detection Feature
Estimated code review effort: 3 (Moderate) | ~30 minutes Sequence Diagram(s)sequenceDiagram
participant CLI as backend/cli.py
participant Profile as get_event_profile
participant Events as extract_audio_events
participant ONNX as ONNXRuntime YAMNet
participant Suggest as _suggest_clips
CLI->>Profile: get_event_profile(video, segments)
Profile->>Events: extract_audio_events(video)
Events->>ONNX: run inference on waveform
ONNX-->>Events: per-frame reaction scores
Events-->>Profile: events_data
Profile-->>CLI: reaction_scores
CLI->>Suggest: _suggest_clips(..., reaction_scores)
Suggest-->>CLI: scored clips with "laughter" reason tags
sequenceDiagram
participant Client
participant Main as backend/main.py
participant Events as extract_audio_events
participant Packer as write_packed
Client->>Main: handle_pack_transcript(params)
alt events_data not provided
Main->>Events: extract_audio_events(file_path)
Events-->>Main: events_data
end
Main->>Packer: write_packed(energy_data, events_data)
Packer-->>Client: packed markdown with reactions section
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
Review the following changes in direct dependencies. Learn more about Socket for GitHub.
|
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (6)
plans/moment-detection.md (1)
49-55: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick winDrop the OpenCV-DNN fallback from the plan.
This section still treats
cv2.dnnas a live zero-dependency path, but the later dependency note and the implementedbackend/services/audio_events.pypath already commit toonnxruntimeafter that spike failed. Keeping both reads as contradictory guidance.♻️ Proposed edit
- - Runtime: `onnxruntime`. **First test whether `cv2.dnn` (already loads YuNet `.onnx`) can run yamnet.onnx** — if yes, zero new deps. + - Runtime: `onnxruntime`; OpenCV-DNN was spiked and rejected for YAMNet.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@plans/moment-detection.md` around lines 49 - 55, Remove the OpenCV-DNN fallback from this plan section and make the runtime path consistently use onnxruntime for backend/services/audio_events.py. Update the bullets that mention “first test whether cv2.dnn can run yamnet.onnx” so they no longer imply a zero-dependency alternative, and keep the guidance aligned with the committed YAMNet ONNX implementation and backend/models/yamnet.onnx shipping plan.backend/cli.py (2)
1820-1828: 🎯 Functional Correctness | 🔵 Trivial | ⚡ Quick winConfirm reaction threshold calibration vs. energy threshold.
The "laughter" reason fires at
max_r > 3while the sibling "high_energy" reason (line 1817) fires atmax_e > 7, on what appears to be a similarly 0-10-scaled signal (pertest_score_is_clamped_to_tenin tests/test_audio_events.py). This means laughter tags will surface roughly twice as easily as energy tags for comparable signal strength. If intentional (e.g., reactions are rarer/more decisive signals), fine — otherwise worth aligning the thresholds.Separately, since this is a new scoring path, consider adding a unit test for
_suggest_clipsexercisingreaction_scores(e.g., asserting the "laughter" reason and score bump appear for a window with a high reaction score), since tests/test_audio_events.py only coverscompute_event_scores, not this CLI integration.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@backend/cli.py` around lines 1820 - 1828, The reaction scoring path in _suggest_clips is using a much lower threshold for the "laughter" reason than the sibling "high_energy" path, so confirm the intended calibration and either align the `max_r > 3` check with the `max_e > 7` style threshold or document that reactions are intentionally more sensitive. If this behavior stays, add a unit test around `_suggest_clips` covering `reaction_scores` to verify the "laughter" reason and score bump for a high-reaction window, since existing coverage in `tests/test_audio_events.py` only exercises compute_event_scores.
662-680: 🎯 Functional Correctness | 🔵 Trivial | ⚡ Quick winReaction detection is silently disabled by
--no-energy/fast mode.Reaction/laughter detection (670-678) is nested inside
if config.get("energy_boost", True):, so setting--no-energy(line 479) or--fast(line 495, which also setsenergy_boost = False) will skip laughter detection too, even though the two signals are conceptually independent ("audio energy" vs "reaction/laughter"). This may be intentional for fast-mode performance, but the flag name and messaging ("Analyzing audio energy...", "Audio analysis skipped (--no-energy)") don't make this coupling obvious to a user who only wants to skip energy analysis.Consider decoupling reaction detection from
energy_boost, or at least updating the skip message/flag docs to clarify that--no-energydisables all audio-based signals (energy + reactions).🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@backend/cli.py` around lines 662 - 680, Reaction/laughter detection is currently tied to the energy analysis gate in the CLI flow, so `--no-energy` and `--fast` also disable reactions unexpectedly. Update the logic in `backend/cli.py` around the audio analysis block so `audio_events_available()` / `get_event_profile()` are controlled independently from `config.get("energy_boost", True)`, or else make the `energy_boost` and `--no-energy` messaging/docs explicitly state that they disable both energy and reaction signals. Keep the existing `get_energy_profile` path focused on energy only and use the `audio_events_available` / `reaction_scores` handling to preserve reaction detection when energy analysis is skipped.backend/services/transcript_packer.py (2)
337-352: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick winDuplicate “near” snippet logic between energy peaks and reactions sections.
Lines 346-350 and 371-375 are identical. Extract a shared helper to avoid drift between the two sections.
♻️ Proposed refactor
+def _near_snippet(phrases: list[dict], t: float, max_len: int = 70) -> str: + near = _phrase_for_time(phrases, t) + if not near: + return "" + txt = near["text"] + return f' near: "{txt[:max_len]}{"…" if len(txt) > max_len else ""}"' + ... for e in peaks: t = float(e.get("time", 0)) rms = float(e.get("rms_db", 0)) - near = _phrase_for_time(phrases, t) - snippet = "" - if near: - txt = near["text"] - snippet = f' near: "{txt[:70]}{"…" if len(txt) > 70 else ""}"' + snippet = _near_snippet(phrases, t) lines.append(f"[{_fmt_ts(t)}] {rms:.1f}dB{snippet}") ... for e in reactions: t = float(e.get("time", 0)) kind = max(("laughter", "cheering", "screaming"), key=lambda k: e.get(k, 0)) - near = _phrase_for_time(phrases, t) - snippet = "" - if near: - txt = near["text"] - snippet = f' near: "{txt[:70]}{"…" if len(txt) > 70 else ""}"' + snippet = _near_snippet(phrases, t) lines.append(f"[{_fmt_ts(t)}] {kind} {e.get(kind, 0):.2f}{snippet}")Also applies to: 354-378
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@backend/services/transcript_packer.py` around lines 337 - 352, The “near” snippet formatting is duplicated between the energy peaks and reactions blocks in transcript_packer.py, so the two sections can drift apart. Extract the shared snippet-building logic into a helper near the existing transcript packing helpers (for example alongside _phrase_for_time and _fmt_ts), then update the energy peaks loop and the reactions loop to call that helper instead of repeating the inline text truncation and quoting logic.
354-378: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick winMissing test coverage for the new “Laughter & reactions” section.
The added test suite covers
audio_events.py/profiles.pybut notpack_transcript's new rendering path (threshold filter, truncation,kindselection). Consider adding a focused unit test.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@backend/services/transcript_packer.py` around lines 354 - 378, Add focused unit coverage for pack_transcript’s new “Laughter & reactions” rendering path in transcript_packer.py. Create a test around pack_transcript that feeds events_data with mixed laughter/cheering/screaming values so you verify the threshold filter, sorting/truncation to REACTIONS_TO_REPORT, and that the displayed kind comes from the max-valued reaction field. Also assert the rendered section text includes the expected timestamped lines and optional phrase snippet behavior.backend/main.py (1)
90-95: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick winLog the swallowed exception for the new reaction-extraction path.
except Exception: pass(flagged by Ruff S110/BLE001) hides failures inextract_audio_events— e.g. missingonnxruntime, corrupt model — indistinguishably from “no reactions found.” Since this is a brand-new signal, at least logging at debug level would help diagnose silent breakage without affecting the graceful-degradation behavior.🩹 Proposed fix (apply to both occurrences)
+import logging + +logger = logging.getLogger(__name__) + events_data = None try: from services.audio_events import extract_audio_events events_data = extract_audio_events(file_path) - except Exception: - pass # reactions are a nice-to-have + except Exception as e: + logger.debug("audio event extraction skipped: %s", e)Also applies to: 264-277
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@backend/main.py` around lines 90 - 95, The new reaction-extraction paths around extract_audio_events are swallowing all exceptions with a bare except, which hides real failures as if no reactions were found. Update both occurrences in backend/main.py to catch the exception as a variable and emit a debug-level log with the exception details and file context, while preserving the existing graceful fallback behavior by leaving events_data unset/None. Use the extract_audio_events call sites to locate the two blocks that need the same change.Source: Linters/SAST tools
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@backend/services/audio_events.py`:
- Around line 113-127: `extract_audio_events` in `audio_events.py` runs
`session.run(...)` on the full waveform without any timeout, so long inputs can
block indefinitely. Update the ONNX inference path in `extract_audio_events`
(and related waveform-to-scores flow) to process audio in bounded chunks or
execute inference with a timeout/worker guard, similar to the existing ffmpeg
timeout handling in `_read_waveform_16k_mono`/`proc_run`. Keep the
`session.get_inputs()[0].name` and `session.run` usage, but make the inference
step non-blocking for long-form audio.
---
Nitpick comments:
In `@backend/cli.py`:
- Around line 1820-1828: The reaction scoring path in _suggest_clips is using a
much lower threshold for the "laughter" reason than the sibling "high_energy"
path, so confirm the intended calibration and either align the `max_r > 3` check
with the `max_e > 7` style threshold or document that reactions are
intentionally more sensitive. If this behavior stays, add a unit test around
`_suggest_clips` covering `reaction_scores` to verify the "laughter" reason and
score bump for a high-reaction window, since existing coverage in
`tests/test_audio_events.py` only exercises compute_event_scores.
- Around line 662-680: Reaction/laughter detection is currently tied to the
energy analysis gate in the CLI flow, so `--no-energy` and `--fast` also disable
reactions unexpectedly. Update the logic in `backend/cli.py` around the audio
analysis block so `audio_events_available()` / `get_event_profile()` are
controlled independently from `config.get("energy_boost", True)`, or else make
the `energy_boost` and `--no-energy` messaging/docs explicitly state that they
disable both energy and reaction signals. Keep the existing `get_energy_profile`
path focused on energy only and use the `audio_events_available` /
`reaction_scores` handling to preserve reaction detection when energy analysis
is skipped.
In `@backend/main.py`:
- Around line 90-95: The new reaction-extraction paths around
extract_audio_events are swallowing all exceptions with a bare except, which
hides real failures as if no reactions were found. Update both occurrences in
backend/main.py to catch the exception as a variable and emit a debug-level log
with the exception details and file context, while preserving the existing
graceful fallback behavior by leaving events_data unset/None. Use the
extract_audio_events call sites to locate the two blocks that need the same
change.
In `@backend/services/transcript_packer.py`:
- Around line 337-352: The “near” snippet formatting is duplicated between the
energy peaks and reactions blocks in transcript_packer.py, so the two sections
can drift apart. Extract the shared snippet-building logic into a helper near
the existing transcript packing helpers (for example alongside _phrase_for_time
and _fmt_ts), then update the energy peaks loop and the reactions loop to call
that helper instead of repeating the inline text truncation and quoting logic.
- Around line 354-378: Add focused unit coverage for pack_transcript’s new
“Laughter & reactions” rendering path in transcript_packer.py. Create a test
around pack_transcript that feeds events_data with mixed
laughter/cheering/screaming values so you verify the threshold filter,
sorting/truncation to REACTIONS_TO_REPORT, and that the displayed kind comes
from the max-valued reaction field. Also assert the rendered section text
includes the expected timestamped lines and optional phrase snippet behavior.
In `@plans/moment-detection.md`:
- Around line 49-55: Remove the OpenCV-DNN fallback from this plan section and
make the runtime path consistently use onnxruntime for
backend/services/audio_events.py. Update the bullets that mention “first test
whether cv2.dnn can run yamnet.onnx” so they no longer imply a zero-dependency
alternative, and keep the guidance aligned with the committed YAMNet ONNX
implementation and backend/models/yamnet.onnx shipping plan.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 0b30b283-6f36-4dd3-865e-a909896a4942
⛔ Files ignored due to path filters (1)
backend/models/yamnet_class_map.csvis excluded by!**/*.csv
📒 Files selected for processing (10)
backend/cli.pybackend/main.pybackend/models/yamnet.onnxbackend/requirements-runtime.txtbackend/requirements.txtbackend/services/audio_events.pybackend/services/profiles.pybackend/services/transcript_packer.pyplans/moment-detection.mdtests/test_audio_events.py
| session = _get_session() | ||
| if session is None: | ||
| return [] | ||
|
|
||
| waveform = _read_waveform_16k_mono(video_path) | ||
| if waveform is None or waveform.size == 0: | ||
| return [] | ||
|
|
||
| input_name = session.get_inputs()[0].name | ||
| scores = session.run(None, {input_name: waveform})[0] # [frames, 521] | ||
| n_frames = scores.shape[0] | ||
| if n_frames == 0: | ||
| return [] | ||
|
|
||
| duration = waveform.size / 16000.0 |
There was a problem hiding this comment.
🩺 Stability & Availability | 🟠 Major | ⚡ Quick win
No timeout on the ONNX inference call for potentially long audio.
_read_waveform_16k_mono bounds ffmpeg with a 600s timeout, but the subsequent session.run(...) call over the full waveform (which can be hours long for long-form video) has no bound. On a long video this synchronously blocks whatever thread invoked extract_audio_events, unlike the extraction step. Consider chunking the waveform (YAMNet's hop is small, so windows can be processed independently) or running inference in an executor with a timeout, mirroring the safeguard already used for proc_run.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@backend/services/audio_events.py` around lines 113 - 127,
`extract_audio_events` in `audio_events.py` runs `session.run(...)` on the full
waveform without any timeout, so long inputs can block indefinitely. Update the
ONNX inference path in `extract_audio_events` (and related waveform-to-scores
flow) to process audio in bounded chunks or execute inference with a
timeout/worker guard, similar to the existing ffmpeg timeout handling in
`_read_waveform_16k_mono`/`proc_run`. Keep the `session.get_inputs()[0].name`
and `session.run` usage, but make the inference step non-blocking for long-form
audio.
What
Adds a language-independent reaction detection signal (laughter, cheering, applause, screaming) to clip selection, plus the profile scaffold for generalized moment detection. Full design in
plans/moment-detection.md.Why
Laughter is the strongest "something good happened" signal and works in any language. It anti-correlates with speech at the frame level, which makes it a reliable anchor. This improves podcast clip selection now and lays the foundation for auto-cutting highlights from any content (party videos, action) as separate profiles of one engine.
How
backend/services/audio_events.py— YAMNet (AudioSet) as a self-contained ONNX graph via onnxruntime. Raw 16kHz waveform in, log-mel baked into the model → no torch/TF, no librosa, stays on the hermetic native runtime path. Mirrorsaudio_analyzer.py'sextract_* / compute_*_scores / get_*_profileshape. Degrades gracefully if the model or onnxruntime is missing.backend/services/profiles.py—ContentProfilescaffold (podcast/party/action): detection-time channel weights + candidate source (llmvssaliency), orthogonal toFormatSpec(render-time aspect ratio).onnxruntimeadded to both requirements manifests. Modelyamnet.onnx(16MB) committed alongside the existing YuNet ONNX.Cost
~380x realtime — a 3h video is ~28s of reaction analysis.
Non-regression
Default profile is
podcast; reactions are additive signals that never rewrite existing candidates. Everything degrades gracefully when the model is absent.Tests
tests/test_audio_events.py(10 tests) covering event scoring + profile scaffold.test_ai_fallbackfailures are pre-existing and environment-dependent (confirmed identical with these changes stashed).Validated
Swept 32 real clips: correctly ranked the comedic show + 2 chuckle moments above dry technical clips (which scored 0.00). Laughter fires as a sharp 1-2 frame spike anti-correlated with speech, e.g.
[64:56] laughter 0.25 near: "and that's the show folks".Summary by CodeRabbit
New Features
Bug Fixes