fix: three-layer protection against unbounded snapshot growth by bburda · Pull Request #324 · selfpatch/ros2_medkit

bburda · 2026-03-29T12:48:22Z

Summary

Three-layer protection against unbounded snapshot growth (issue #308) plus fix for persistent test_operation_handlers SIGABRT.

Snapshot storm fix:

FaultReporter PASSED debounce: should_forward_passed() applies same threshold/window filtering as FAILED events (previously all PASSED bypassed filtering)
Snapshot storage limit: snapshots.max_per_fault (default 10, reject-new strategy - keeps earliest snapshots)
Snapshot recapture cooldown: snapshots.recapture_cooldown_sec (default 60s, evicted on fault clear including auto-cleared symptoms)

Action goal teardown fix:

Add OperationManager::shutdown() called from GatewayNode destructor to clear subscriptions/clients while executor is still running
Fix test action server to goal_handle->abort() on shutdown exit - destroying ServerGoalHandle without finalization causes SIGABRT

Issue

closes [BUG] Fault detail endpoint returns unbounded number of snapshots #308

Type

Bug fix
New feature or tests
Breaking change
Documentation only

Testing

9 new tests: PASSED filtering (3), SQLite snapshot limit (3), InMemory snapshot limit (2), cooldown integration (1)
TDD: tests written with stubs first (RED), then implementation (GREEN)
All CI jobs pass including Rolling (previously persistent SIGABRT)

New parameters

Parameter	Default	Description
`snapshots.max_per_fault`	10	Max snapshots per fault code (0 = unlimited, reject-new strategy)
`snapshots.recapture_cooldown_sec`	60.0	Min seconds between captures for same fault (0 = disabled)

Checklist

Breaking changes are clearly described (and announced in docs / changelog if needed)
Tests were added or updated if needed
Docs were updated if behavior or public API changed

Copilot

Pull request overview

This PR aims to prevent unbounded fault snapshot growth by adding rate/volume limiting at multiple layers (fault reporter filtering, storage caps, and fault-manager recapture throttling) to address issue #308.

Changes:

Add PASSED-event tracking/filtering logic to LocalFilter and corresponding unit tests.
Add a per-fault snapshot cap to FaultStorage backends (SQLite + in-memory) and unit tests for SQLite.
Add a snapshot recapture cooldown to FaultManagerNode and new ROS parameters (snapshot.max_per_fault, snapshot.recapture_cooldown_sec).

Reviewed changes

Copilot reviewed 10 out of 10 changed files in this pull request and generated 9 comments.

Show a summary per file

File	Description
src/ros2_medkit_fault_reporter/test/test_local_filter.cpp	Adds PASSED filtering unit tests for `LocalFilter`.
src/ros2_medkit_fault_reporter/src/local_filter.cpp	Implements PASSED filtering state + tracker reset/clear changes.
src/ros2_medkit_fault_reporter/include/ros2_medkit_fault_reporter/local_filter.hpp	Exposes `should_forward_passed()` and adds a PASSED tracker map.
src/ros2_medkit_fault_manager/test/test_sqlite_storage.cpp	Adds tests validating snapshot cap behavior for SQLite storage.
src/ros2_medkit_fault_manager/src/sqlite_fault_storage.cpp	Implements reject-new snapshot limiting for SQLite backend.
src/ros2_medkit_fault_manager/src/fault_storage.cpp	Implements reject-new snapshot limiting for in-memory backend.
src/ros2_medkit_fault_manager/src/fault_manager_node.cpp	Adds snapshot-related params, applies snapshot cap, adds recapture cooldown logic.
src/ros2_medkit_fault_manager/include/ros2_medkit_fault_manager/sqlite_fault_storage.hpp	Adds snapshot cap setter + member state.
src/ros2_medkit_fault_manager/include/ros2_medkit_fault_manager/fault_storage.hpp	Extends `FaultStorage` interface with snapshot cap setter; adds state to in-memory backend.
src/ros2_medkit_fault_manager/include/ros2_medkit_fault_manager/fault_manager_node.hpp	Adds cooldown config/state (mutex + map) for per-fault capture throttling.

src/ros2_medkit_fault_manager/src/fault_storage.cpp

src/ros2_medkit_fault_manager/src/sqlite_fault_storage.cpp

src/ros2_medkit_fault_manager/src/fault_storage.cpp

src/ros2_medkit_fault_reporter/src/local_filter.cpp

src/ros2_medkit_fault_manager/src/fault_manager_node.cpp

Copilot · 2026-03-29T12:51:56Z

src/ros2_medkit_fault_reporter/include/ros2_medkit_fault_reporter/local_filter.hpp

+  /// Check if a PASSED event should be forwarded
+  bool should_forward_passed(const std::string & fault_code);
+


LocalFilter::should_forward_passed() is added and tested, but FaultReporter::report_passed() still bypasses local filtering and always forwards PASSED events (see src/ros2_medkit_fault_reporter/src/fault_reporter.cpp). As-is, the new PASSED filtering will never be applied in production; update the PASSED reporting path to call the new method (or reuse should_forward(...)) before sending the request.

Fixed in squash.

src/ros2_medkit_fault_manager/src/fault_manager_node.cpp

Copilot · 2026-03-29T12:51:57Z

src/ros2_medkit_fault_manager/src/fault_storage.cpp

 void InMemoryFaultStorage::store_snapshot(const SnapshotData & snapshot) {
  std::lock_guard<std::mutex> lock(mutex_);
+  if (max_snapshots_per_fault_ > 0) {
+    size_t count = 0;
+    for (const auto & s : snapshots_) {
+      if (s.fault_code == snapshot.fault_code) {
+        ++count;
+      }
+    }
+    if (count >= max_snapshots_per_fault_) {
+      return;  // Reject new - keep earliest snapshots
+    }
+  }
  snapshots_.push_back(snapshot);
 }


InMemoryFaultStorage::store_snapshot() now scans the entire snapshots_ vector to count per-fault snapshots on every insert. This is O(N) per snapshot and can become a bottleneck even with a small per-fault limit if many faults are present. Consider tracking per-fault snapshot counts in a map (and decrementing on clear/delete) to keep this check O(1).

Acceptable with max_per_fault limit (default 10). Total snapshots bounded by num_faults * 10.

Copilot

Pull request overview

Copilot reviewed 16 out of 16 changed files in this pull request and generated 4 comments.

src/ros2_medkit_fault_manager/test/test_fault_manager.cpp

src/ros2_medkit_fault_manager/src/fault_manager_node.cpp

src/ros2_medkit_fault_manager/src/fault_storage.cpp

Copilot

Pull request overview

Copilot reviewed 19 out of 19 changed files in this pull request and generated 2 comments.

src/ros2_medkit_fault_manager/src/fault_manager_node.cpp

Copilot · 2026-03-29T19:52:00Z

src/ros2_medkit_fault_manager/include/ros2_medkit_fault_manager/fault_storage.hpp

+  /// Set maximum snapshots per fault code (0 = unlimited)
+  virtual void set_max_snapshots_per_fault(size_t max_count) = 0;
+


FaultStorage is a public abstract interface, and adding a new pure-virtual method (set_max_snapshots_per_fault) is a source/ABI breaking change for any downstream storage implementations. If backwards compatibility matters, consider providing a default no-op implementation in the base class (non-pure virtual), or bump the relevant version / clearly mark this as a breaking change.

Fixed - changed to virtual with default no-op to avoid breaking downstream.

Fixes unbounded snapshot accumulation when faults cycle between CONFIRMED and CLEARED states (issue #308). Root cause: diagnostic bridge forwarded every PASSED event without filtering, causing rapid fault status cycling and snapshot capture on each re-confirmation. Three-layer defense: 1. FaultReporter PASSED debounce: should_forward_passed() in LocalFilter applies same threshold/window filtering as FAILED. Previously report_passed() bypassed all local filtering. 2. Snapshot storage limit: set_max_snapshots_per_fault() on both SQLite and InMemory storage (default 10, reject-new strategy). Keeps earliest snapshots which are most valuable for diagnostics. 3. Snapshot recapture cooldown: snapshots.recapture_cooldown_sec parameter (default 60s) skips capture if same fault_code was captured within the cooldown window. Evicted on fault clear including auto-cleared correlation symptoms. Also updates diagnostic_bridge integration test to send enough OK diagnostics to meet the PASSED filtering threshold. Closes #308

src/ros2_medkit_fault_manager/test/test_fault_manager.cpp

src/ros2_medkit_fault_manager/src/fault_storage.cpp

Add OperationManager::shutdown() called from GatewayNode destructor to clear subscriptions, tracked goals, and service clients while executor can still process pending callbacks safely. Fix test action server to abort unfinished goals on shutdown - destroying rclcpp_action::ServerGoalHandle without calling succeed/cancel/abort triggers "terminate called without an active exception" (SIGABRT). This was the root cause of persistent test_operation_handlers crashes across all distros.

mfaferek93

LGTM!

Copilot AI review requested due to automatic review settings March 29, 2026 12:48

Copilot started reviewing on behalf of bburda March 29, 2026 12:48 View session

Copilot AI reviewed Mar 29, 2026

View reviewed changes

bburda requested a review from Copilot March 29, 2026 14:50

bburda self-assigned this Mar 29, 2026

bburda requested a review from mfaferek93 March 29, 2026 14:50

Copilot started reviewing on behalf of bburda March 29, 2026 14:51 View session

Copilot AI reviewed Mar 29, 2026

View reviewed changes

bburda requested a review from Copilot March 29, 2026 19:47

Copilot started reviewing on behalf of bburda March 29, 2026 19:48 View session

Copilot AI reviewed Mar 29, 2026

View reviewed changes

bburda force-pushed the fix/snapshot-recapture-protection branch from aa74acf to 9340878 Compare March 29, 2026 19:55

bburda added the bug Something isn't working label Mar 29, 2026

mfaferek93 reviewed Mar 29, 2026

View reviewed changes

src/ros2_medkit_fault_manager/test/test_fault_manager.cpp Show resolved Hide resolved

src/ros2_medkit_fault_manager/src/fault_storage.cpp Outdated Show resolved Hide resolved

bburda force-pushed the fix/snapshot-recapture-protection branch from 9340878 to 7f5bbf3 Compare March 29, 2026 20:16

bburda force-pushed the fix/snapshot-recapture-protection branch from 7f5bbf3 to 3bf7c40 Compare March 29, 2026 20:24

mfaferek93 self-requested a review March 30, 2026 06:35

mfaferek93 approved these changes Mar 30, 2026

View reviewed changes

bburda merged commit b05ac02 into main Mar 30, 2026
11 checks passed

bburda deleted the fix/snapshot-recapture-protection branch March 30, 2026 06:39

		/// Check if a PASSED event should be forwarded
		bool should_forward_passed(const std::string & fault_code);

		/// Set maximum snapshots per fault code (0 = unlimited)
		virtual void set_max_snapshots_per_fault(size_t max_count) = 0;

Conversation

bburda commented Mar 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Issue

Type

Testing

New parameters

Checklist

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI Mar 29, 2026

Choose a reason for hiding this comment

Uh oh!

bburda Mar 29, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Copilot AI Mar 29, 2026

Choose a reason for hiding this comment

Uh oh!

bburda Mar 29, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Copilot AI Mar 29, 2026

Choose a reason for hiding this comment

Uh oh!

bburda Mar 29, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

mfaferek93 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

bburda commented Mar 29, 2026 •

edited

Loading