Skip to content

Laughter/reaction detection channel (YAMNet ONNX, no-torch)#43

Open
nmbrthirteen wants to merge 1 commit into
mainfrom
feat/moment-detection-phase1
Open

Laughter/reaction detection channel (YAMNet ONNX, no-torch)#43
nmbrthirteen wants to merge 1 commit into
mainfrom
feat/moment-detection-phase1

Conversation

@nmbrthirteen

@nmbrthirteen nmbrthirteen commented Jul 5, 2026

Copy link
Copy Markdown
Owner

What

Adds a language-independent reaction detection signal (laughter, cheering, applause, screaming) to clip selection, plus the profile scaffold for generalized moment detection. Full design in plans/moment-detection.md.

Why

Laughter is the strongest "something good happened" signal and works in any language. It anti-correlates with speech at the frame level, which makes it a reliable anchor. This improves podcast clip selection now and lays the foundation for auto-cutting highlights from any content (party videos, action) as separate profiles of one engine.

How

  • backend/services/audio_events.py — YAMNet (AudioSet) as a self-contained ONNX graph via onnxruntime. Raw 16kHz waveform in, log-mel baked into the model → no torch/TF, no librosa, stays on the hermetic native runtime path. Mirrors audio_analyzer.py's extract_* / compute_*_scores / get_*_profile shape. Degrades gracefully if the model or onnxruntime is missing.
  • backend/services/profiles.pyContentProfile scaffold (podcast/party/action): detection-time channel weights + candidate source (llm vs saliency), orthogonal to FormatSpec (render-time aspect ratio).
  • Reactions flow where energy already does: the packed transcript view gains a Laughter & reactions section (visible to the selector), and the CLI heuristic gains a laughter dimension.
  • onnxruntime added to both requirements manifests. Model yamnet.onnx (16MB) committed alongside the existing YuNet ONNX.

Cost

~380x realtime — a 3h video is ~28s of reaction analysis.

Non-regression

Default profile is podcast; reactions are additive signals that never rewrite existing candidates. Everything degrades gracefully when the model is absent.

Tests

  • New tests/test_audio_events.py (10 tests) covering event scoring + profile scaffold.
  • Full suite: 351 passed. The 3 test_ai_fallback failures are pre-existing and environment-dependent (confirmed identical with these changes stashed).

Validated

Swept 32 real clips: correctly ranked the comedic show + 2 chuckle moments above dry technical clips (which scored 0.00). Laughter fires as a sharp 1-2 frame spike anti-correlated with speech, e.g. [64:56] laughter 0.25 near: "and that's the show folks".

Summary by CodeRabbit

  • New Features

    • Added richer moment detection that can factor in laughter, cheering, screaming, and energy signals.
    • Added content profiles to better tailor clip selection for different types of content.
    • Packed transcripts can now include a new “Laughter & reactions” section.
  • Bug Fixes

    • Improved clip suggestions so reaction spikes can boost scoring and surface more engaging moments.

Detect laughter, cheering, applause and screaming as a language-independent
reaction signal for clip selection. YAMNet runs as a self-contained ONNX graph
via onnxruntime (raw 16kHz waveform in, log-mel baked into the model), so it
adds no torch/TF dependency and stays on the hermetic native runtime path.

- audio_events.py: reaction channel mirroring audio_analyzer's shape, degrades
  gracefully when the model or onnxruntime is absent
- profiles.py: ContentProfile scaffold (podcast/party/action) — detection-time
  channel weights + candidate source, orthogonal to FormatSpec
- reactions flow where energy already does: the packed transcript view gains a
  'Laughter & reactions' section (visible to the selector), and the CLI
  heuristic gains a laughter dimension
- ~380x realtime; a 3h video is ~28s of reaction analysis

Podcast selection is unchanged by default: reactions are additive and never
rewrite existing candidates.
@coderabbitai

coderabbitai Bot commented Jul 5, 2026

Copy link
Copy Markdown

Review Change Stack

📝 Walkthrough

Walkthrough

Adds a new audio event (laughter/reaction) detection service using YAMNet ONNX, along with a content-profile module for weighting detection signals. Integrates reaction scoring into CLI clip suggestion, transcript packing, and main.py handlers, adds onnxruntime dependencies, tests, and a design plan document.

Changes

Audio Event Detection Feature

Layer / File(s) Summary
Audio events detection service
backend/services/audio_events.py
New module extracting audio via ffmpeg, running YAMNet ONNX inference, and computing per-segment reaction scores (laughter, cheering, screaming, speech).
Content profiles module
backend/services/profiles.py
New ContentProfile dataclass, PROFILES mapping (podcast/party/action), DEFAULT_PROFILE, and get_profile() resolver.
Runtime dependency additions
backend/requirements.txt, backend/requirements-runtime.txt
Adds onnxruntime>=1.16.0 with descriptive comments for the audio-event detection path.
Transcript packer laughter/reactions section
backend/services/transcript_packer.py
Adds reaction-reporting constants and extends pack_transcript/write_packed to accept events_data and render a "Laughter & reactions" section.
main.py wiring
backend/main.py
handle_transcribe and handle_pack_transcript extract events_data via extract_audio_events and forward it to write_packed.
CLI clip scoring with reaction signals
backend/cli.py
Computes reaction_scores in Step 2, passes them into _suggest_clips, and adds a laughter/reactions scoring component and reason tag.
Tests
tests/test_audio_events.py
New tests for compute_event_scores and get_profile behavior.
Design plan document
plans/moment-detection.md
New plan describing generalized moment-detection engine, channel weights, and rollout guardrails.

Estimated code review effort: 3 (Moderate) | ~30 minutes

Sequence Diagram(s)

sequenceDiagram
  participant CLI as backend/cli.py
  participant Profile as get_event_profile
  participant Events as extract_audio_events
  participant ONNX as ONNXRuntime YAMNet
  participant Suggest as _suggest_clips

  CLI->>Profile: get_event_profile(video, segments)
  Profile->>Events: extract_audio_events(video)
  Events->>ONNX: run inference on waveform
  ONNX-->>Events: per-frame reaction scores
  Events-->>Profile: events_data
  Profile-->>CLI: reaction_scores
  CLI->>Suggest: _suggest_clips(..., reaction_scores)
  Suggest-->>CLI: scored clips with "laughter" reason tags
Loading
sequenceDiagram
  participant Client
  participant Main as backend/main.py
  participant Events as extract_audio_events
  participant Packer as write_packed

  Client->>Main: handle_pack_transcript(params)
  alt events_data not provided
    Main->>Events: extract_audio_events(file_path)
    Events-->>Main: events_data
  end
  Main->>Packer: write_packed(energy_data, events_data)
  Packer-->>Client: packed markdown with reactions section
Loading
🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately reflects the new YAMNet ONNX-based laughter/reaction detection channel and the no-torch runtime approach.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/moment-detection-phase1

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands.

@socket-security

Copy link
Copy Markdown

Review the following changes in direct dependencies. Learn more about Socket for GitHub.

Diff Package Supply Chain
Security
Vulnerability Quality Maintenance License
Addedpypi/​onnxruntime@​1.27.074100100100100

View full report

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (6)
plans/moment-detection.md (1)

49-55: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick win

Drop the OpenCV-DNN fallback from the plan.

This section still treats cv2.dnn as a live zero-dependency path, but the later dependency note and the implemented backend/services/audio_events.py path already commit to onnxruntime after that spike failed. Keeping both reads as contradictory guidance.

♻️ Proposed edit
-  - Runtime: `onnxruntime`. **First test whether `cv2.dnn` (already loads YuNet `.onnx`) can run yamnet.onnx** — if yes, zero new deps.
+  - Runtime: `onnxruntime`; OpenCV-DNN was spiked and rejected for YAMNet.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@plans/moment-detection.md` around lines 49 - 55, Remove the OpenCV-DNN
fallback from this plan section and make the runtime path consistently use
onnxruntime for backend/services/audio_events.py. Update the bullets that
mention “first test whether cv2.dnn can run yamnet.onnx” so they no longer imply
a zero-dependency alternative, and keep the guidance aligned with the committed
YAMNet ONNX implementation and backend/models/yamnet.onnx shipping plan.
backend/cli.py (2)

1820-1828: 🎯 Functional Correctness | 🔵 Trivial | ⚡ Quick win

Confirm reaction threshold calibration vs. energy threshold.

The "laughter" reason fires at max_r > 3 while the sibling "high_energy" reason (line 1817) fires at max_e > 7, on what appears to be a similarly 0-10-scaled signal (per test_score_is_clamped_to_ten in tests/test_audio_events.py). This means laughter tags will surface roughly twice as easily as energy tags for comparable signal strength. If intentional (e.g., reactions are rarer/more decisive signals), fine — otherwise worth aligning the thresholds.

Separately, since this is a new scoring path, consider adding a unit test for _suggest_clips exercising reaction_scores (e.g., asserting the "laughter" reason and score bump appear for a window with a high reaction score), since tests/test_audio_events.py only covers compute_event_scores, not this CLI integration.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@backend/cli.py` around lines 1820 - 1828, The reaction scoring path in
_suggest_clips is using a much lower threshold for the "laughter" reason than
the sibling "high_energy" path, so confirm the intended calibration and either
align the `max_r > 3` check with the `max_e > 7` style threshold or document
that reactions are intentionally more sensitive. If this behavior stays, add a
unit test around `_suggest_clips` covering `reaction_scores` to verify the
"laughter" reason and score bump for a high-reaction window, since existing
coverage in `tests/test_audio_events.py` only exercises compute_event_scores.

662-680: 🎯 Functional Correctness | 🔵 Trivial | ⚡ Quick win

Reaction detection is silently disabled by --no-energy/fast mode.

Reaction/laughter detection (670-678) is nested inside if config.get("energy_boost", True):, so setting --no-energy (line 479) or --fast (line 495, which also sets energy_boost = False) will skip laughter detection too, even though the two signals are conceptually independent ("audio energy" vs "reaction/laughter"). This may be intentional for fast-mode performance, but the flag name and messaging ("Analyzing audio energy...", "Audio analysis skipped (--no-energy)") don't make this coupling obvious to a user who only wants to skip energy analysis.

Consider decoupling reaction detection from energy_boost, or at least updating the skip message/flag docs to clarify that --no-energy disables all audio-based signals (energy + reactions).

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@backend/cli.py` around lines 662 - 680, Reaction/laughter detection is
currently tied to the energy analysis gate in the CLI flow, so `--no-energy` and
`--fast` also disable reactions unexpectedly. Update the logic in
`backend/cli.py` around the audio analysis block so `audio_events_available()` /
`get_event_profile()` are controlled independently from
`config.get("energy_boost", True)`, or else make the `energy_boost` and
`--no-energy` messaging/docs explicitly state that they disable both energy and
reaction signals. Keep the existing `get_energy_profile` path focused on energy
only and use the `audio_events_available` / `reaction_scores` handling to
preserve reaction detection when energy analysis is skipped.
backend/services/transcript_packer.py (2)

337-352: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick win

Duplicate “near” snippet logic between energy peaks and reactions sections.

Lines 346-350 and 371-375 are identical. Extract a shared helper to avoid drift between the two sections.

♻️ Proposed refactor
+def _near_snippet(phrases: list[dict], t: float, max_len: int = 70) -> str:
+    near = _phrase_for_time(phrases, t)
+    if not near:
+        return ""
+    txt = near["text"]
+    return f' near: "{txt[:max_len]}{"…" if len(txt) > max_len else ""}"'
+
 ...
         for e in peaks:
             t = float(e.get("time", 0))
             rms = float(e.get("rms_db", 0))
-            near = _phrase_for_time(phrases, t)
-            snippet = ""
-            if near:
-                txt = near["text"]
-                snippet = f' near: "{txt[:70]}{"…" if len(txt) > 70 else ""}"'
+            snippet = _near_snippet(phrases, t)
             lines.append(f"[{_fmt_ts(t)}] {rms:.1f}dB{snippet}")
 ...
             for e in reactions:
                 t = float(e.get("time", 0))
                 kind = max(("laughter", "cheering", "screaming"), key=lambda k: e.get(k, 0))
-                near = _phrase_for_time(phrases, t)
-                snippet = ""
-                if near:
-                    txt = near["text"]
-                    snippet = f' near: "{txt[:70]}{"…" if len(txt) > 70 else ""}"'
+                snippet = _near_snippet(phrases, t)
                 lines.append(f"[{_fmt_ts(t)}] {kind} {e.get(kind, 0):.2f}{snippet}")

Also applies to: 354-378

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@backend/services/transcript_packer.py` around lines 337 - 352, The “near”
snippet formatting is duplicated between the energy peaks and reactions blocks
in transcript_packer.py, so the two sections can drift apart. Extract the shared
snippet-building logic into a helper near the existing transcript packing
helpers (for example alongside _phrase_for_time and _fmt_ts), then update the
energy peaks loop and the reactions loop to call that helper instead of
repeating the inline text truncation and quoting logic.

354-378: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick win

Missing test coverage for the new “Laughter & reactions” section.

The added test suite covers audio_events.py/profiles.py but not pack_transcript's new rendering path (threshold filter, truncation, kind selection). Consider adding a focused unit test.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@backend/services/transcript_packer.py` around lines 354 - 378, Add focused
unit coverage for pack_transcript’s new “Laughter & reactions” rendering path in
transcript_packer.py. Create a test around pack_transcript that feeds
events_data with mixed laughter/cheering/screaming values so you verify the
threshold filter, sorting/truncation to REACTIONS_TO_REPORT, and that the
displayed kind comes from the max-valued reaction field. Also assert the
rendered section text includes the expected timestamped lines and optional
phrase snippet behavior.
backend/main.py (1)

90-95: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick win

Log the swallowed exception for the new reaction-extraction path.

except Exception: pass (flagged by Ruff S110/BLE001) hides failures in extract_audio_events — e.g. missing onnxruntime, corrupt model — indistinguishably from “no reactions found.” Since this is a brand-new signal, at least logging at debug level would help diagnose silent breakage without affecting the graceful-degradation behavior.

🩹 Proposed fix (apply to both occurrences)
+import logging
+
+logger = logging.getLogger(__name__)
+
         events_data = None
         try:
             from services.audio_events import extract_audio_events
             events_data = extract_audio_events(file_path)
-        except Exception:
-            pass  # reactions are a nice-to-have
+        except Exception as e:
+            logger.debug("audio event extraction skipped: %s", e)

Also applies to: 264-277

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@backend/main.py` around lines 90 - 95, The new reaction-extraction paths
around extract_audio_events are swallowing all exceptions with a bare except,
which hides real failures as if no reactions were found. Update both occurrences
in backend/main.py to catch the exception as a variable and emit a debug-level
log with the exception details and file context, while preserving the existing
graceful fallback behavior by leaving events_data unset/None. Use the
extract_audio_events call sites to locate the two blocks that need the same
change.

Source: Linters/SAST tools

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@backend/services/audio_events.py`:
- Around line 113-127: `extract_audio_events` in `audio_events.py` runs
`session.run(...)` on the full waveform without any timeout, so long inputs can
block indefinitely. Update the ONNX inference path in `extract_audio_events`
(and related waveform-to-scores flow) to process audio in bounded chunks or
execute inference with a timeout/worker guard, similar to the existing ffmpeg
timeout handling in `_read_waveform_16k_mono`/`proc_run`. Keep the
`session.get_inputs()[0].name` and `session.run` usage, but make the inference
step non-blocking for long-form audio.

---

Nitpick comments:
In `@backend/cli.py`:
- Around line 1820-1828: The reaction scoring path in _suggest_clips is using a
much lower threshold for the "laughter" reason than the sibling "high_energy"
path, so confirm the intended calibration and either align the `max_r > 3` check
with the `max_e > 7` style threshold or document that reactions are
intentionally more sensitive. If this behavior stays, add a unit test around
`_suggest_clips` covering `reaction_scores` to verify the "laughter" reason and
score bump for a high-reaction window, since existing coverage in
`tests/test_audio_events.py` only exercises compute_event_scores.
- Around line 662-680: Reaction/laughter detection is currently tied to the
energy analysis gate in the CLI flow, so `--no-energy` and `--fast` also disable
reactions unexpectedly. Update the logic in `backend/cli.py` around the audio
analysis block so `audio_events_available()` / `get_event_profile()` are
controlled independently from `config.get("energy_boost", True)`, or else make
the `energy_boost` and `--no-energy` messaging/docs explicitly state that they
disable both energy and reaction signals. Keep the existing `get_energy_profile`
path focused on energy only and use the `audio_events_available` /
`reaction_scores` handling to preserve reaction detection when energy analysis
is skipped.

In `@backend/main.py`:
- Around line 90-95: The new reaction-extraction paths around
extract_audio_events are swallowing all exceptions with a bare except, which
hides real failures as if no reactions were found. Update both occurrences in
backend/main.py to catch the exception as a variable and emit a debug-level log
with the exception details and file context, while preserving the existing
graceful fallback behavior by leaving events_data unset/None. Use the
extract_audio_events call sites to locate the two blocks that need the same
change.

In `@backend/services/transcript_packer.py`:
- Around line 337-352: The “near” snippet formatting is duplicated between the
energy peaks and reactions blocks in transcript_packer.py, so the two sections
can drift apart. Extract the shared snippet-building logic into a helper near
the existing transcript packing helpers (for example alongside _phrase_for_time
and _fmt_ts), then update the energy peaks loop and the reactions loop to call
that helper instead of repeating the inline text truncation and quoting logic.
- Around line 354-378: Add focused unit coverage for pack_transcript’s new
“Laughter & reactions” rendering path in transcript_packer.py. Create a test
around pack_transcript that feeds events_data with mixed
laughter/cheering/screaming values so you verify the threshold filter,
sorting/truncation to REACTIONS_TO_REPORT, and that the displayed kind comes
from the max-valued reaction field. Also assert the rendered section text
includes the expected timestamped lines and optional phrase snippet behavior.

In `@plans/moment-detection.md`:
- Around line 49-55: Remove the OpenCV-DNN fallback from this plan section and
make the runtime path consistently use onnxruntime for
backend/services/audio_events.py. Update the bullets that mention “first test
whether cv2.dnn can run yamnet.onnx” so they no longer imply a zero-dependency
alternative, and keep the guidance aligned with the committed YAMNet ONNX
implementation and backend/models/yamnet.onnx shipping plan.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 0b30b283-6f36-4dd3-865e-a909896a4942

📥 Commits

Reviewing files that changed from the base of the PR and between f9c6f30 and 48e8298.

⛔ Files ignored due to path filters (1)
  • backend/models/yamnet_class_map.csv is excluded by !**/*.csv
📒 Files selected for processing (10)
  • backend/cli.py
  • backend/main.py
  • backend/models/yamnet.onnx
  • backend/requirements-runtime.txt
  • backend/requirements.txt
  • backend/services/audio_events.py
  • backend/services/profiles.py
  • backend/services/transcript_packer.py
  • plans/moment-detection.md
  • tests/test_audio_events.py

Comment on lines +113 to +127
session = _get_session()
if session is None:
return []

waveform = _read_waveform_16k_mono(video_path)
if waveform is None or waveform.size == 0:
return []

input_name = session.get_inputs()[0].name
scores = session.run(None, {input_name: waveform})[0] # [frames, 521]
n_frames = scores.shape[0]
if n_frames == 0:
return []

duration = waveform.size / 16000.0

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🩺 Stability & Availability | 🟠 Major | ⚡ Quick win

No timeout on the ONNX inference call for potentially long audio.

_read_waveform_16k_mono bounds ffmpeg with a 600s timeout, but the subsequent session.run(...) call over the full waveform (which can be hours long for long-form video) has no bound. On a long video this synchronously blocks whatever thread invoked extract_audio_events, unlike the extraction step. Consider chunking the waveform (YAMNet's hop is small, so windows can be processed independently) or running inference in an executor with a timeout, mirroring the safeguard already used for proc_run.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@backend/services/audio_events.py` around lines 113 - 127,
`extract_audio_events` in `audio_events.py` runs `session.run(...)` on the full
waveform without any timeout, so long inputs can block indefinitely. Update the
ONNX inference path in `extract_audio_events` (and related waveform-to-scores
flow) to process audio in bounded chunks or execute inference with a
timeout/worker guard, similar to the existing ffmpeg timeout handling in
`_read_waveform_16k_mono`/`proc_run`. Keep the `session.get_inputs()[0].name`
and `session.run` usage, but make the inference step non-blocking for long-form
audio.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant