Skip to content

fix: three-layer protection against unbounded snapshot growth#324

Merged
bburda merged 2 commits intomainfrom
fix/snapshot-recapture-protection
Mar 30, 2026
Merged

fix: three-layer protection against unbounded snapshot growth#324
bburda merged 2 commits intomainfrom
fix/snapshot-recapture-protection

Conversation

@bburda
Copy link
Copy Markdown
Collaborator

@bburda bburda commented Mar 29, 2026

Summary

Three-layer protection against unbounded snapshot growth (issue #308) plus fix for persistent test_operation_handlers SIGABRT.

Snapshot storm fix:

  • FaultReporter PASSED debounce: should_forward_passed() applies same threshold/window filtering as FAILED events (previously all PASSED bypassed filtering)
  • Snapshot storage limit: snapshots.max_per_fault (default 10, reject-new strategy - keeps earliest snapshots)
  • Snapshot recapture cooldown: snapshots.recapture_cooldown_sec (default 60s, evicted on fault clear including auto-cleared symptoms)

Action goal teardown fix:

  • Add OperationManager::shutdown() called from GatewayNode destructor to clear subscriptions/clients while executor is still running
  • Fix test action server to goal_handle->abort() on shutdown exit - destroying ServerGoalHandle without finalization causes SIGABRT

Issue


Type

  • Bug fix
  • New feature or tests
  • Breaking change
  • Documentation only

Testing

  • 9 new tests: PASSED filtering (3), SQLite snapshot limit (3), InMemory snapshot limit (2), cooldown integration (1)
  • TDD: tests written with stubs first (RED), then implementation (GREEN)
  • All CI jobs pass including Rolling (previously persistent SIGABRT)

New parameters

Parameter Default Description
snapshots.max_per_fault 10 Max snapshots per fault code (0 = unlimited, reject-new strategy)
snapshots.recapture_cooldown_sec 60.0 Min seconds between captures for same fault (0 = disabled)

Checklist

  • Breaking changes are clearly described (and announced in docs / changelog if needed)
  • Tests were added or updated if needed
  • Docs were updated if behavior or public API changed

Copilot AI review requested due to automatic review settings March 29, 2026 12:48
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR aims to prevent unbounded fault snapshot growth by adding rate/volume limiting at multiple layers (fault reporter filtering, storage caps, and fault-manager recapture throttling) to address issue #308.

Changes:

  • Add PASSED-event tracking/filtering logic to LocalFilter and corresponding unit tests.
  • Add a per-fault snapshot cap to FaultStorage backends (SQLite + in-memory) and unit tests for SQLite.
  • Add a snapshot recapture cooldown to FaultManagerNode and new ROS parameters (snapshot.max_per_fault, snapshot.recapture_cooldown_sec).

Reviewed changes

Copilot reviewed 10 out of 10 changed files in this pull request and generated 9 comments.

Show a summary per file
File Description
src/ros2_medkit_fault_reporter/test/test_local_filter.cpp Adds PASSED filtering unit tests for LocalFilter.
src/ros2_medkit_fault_reporter/src/local_filter.cpp Implements PASSED filtering state + tracker reset/clear changes.
src/ros2_medkit_fault_reporter/include/ros2_medkit_fault_reporter/local_filter.hpp Exposes should_forward_passed() and adds a PASSED tracker map.
src/ros2_medkit_fault_manager/test/test_sqlite_storage.cpp Adds tests validating snapshot cap behavior for SQLite storage.
src/ros2_medkit_fault_manager/src/sqlite_fault_storage.cpp Implements reject-new snapshot limiting for SQLite backend.
src/ros2_medkit_fault_manager/src/fault_storage.cpp Implements reject-new snapshot limiting for in-memory backend.
src/ros2_medkit_fault_manager/src/fault_manager_node.cpp Adds snapshot-related params, applies snapshot cap, adds recapture cooldown logic.
src/ros2_medkit_fault_manager/include/ros2_medkit_fault_manager/sqlite_fault_storage.hpp Adds snapshot cap setter + member state.
src/ros2_medkit_fault_manager/include/ros2_medkit_fault_manager/fault_storage.hpp Extends FaultStorage interface with snapshot cap setter; adds state to in-memory backend.
src/ros2_medkit_fault_manager/include/ros2_medkit_fault_manager/fault_manager_node.hpp Adds cooldown config/state (mutex + map) for per-fault capture throttling.

Comment on lines +57 to +59
/// Check if a PASSED event should be forwarded
bool should_forward_passed(const std::string & fault_code);

Copy link

Copilot AI Mar 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LocalFilter::should_forward_passed() is added and tested, but FaultReporter::report_passed() still bypasses local filtering and always forwards PASSED events (see src/ros2_medkit_fault_reporter/src/fault_reporter.cpp). As-is, the new PASSED filtering will never be applied in production; update the PASSED reporting path to call the new method (or reuse should_forward(...)) before sending the request.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in squash.

Comment on lines 298 to 312
void InMemoryFaultStorage::store_snapshot(const SnapshotData & snapshot) {
std::lock_guard<std::mutex> lock(mutex_);
if (max_snapshots_per_fault_ > 0) {
size_t count = 0;
for (const auto & s : snapshots_) {
if (s.fault_code == snapshot.fault_code) {
++count;
}
}
if (count >= max_snapshots_per_fault_) {
return; // Reject new - keep earliest snapshots
}
}
snapshots_.push_back(snapshot);
}
Copy link

Copilot AI Mar 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

InMemoryFaultStorage::store_snapshot() now scans the entire snapshots_ vector to count per-fault snapshots on every insert. This is O(N) per snapshot and can become a bottleneck even with a small per-fault limit if many faults are present. Consider tracking per-fault snapshot counts in a map (and decrementing on clear/delete) to keep this check O(1).

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Acceptable with max_per_fault limit (default 10). Total snapshots bounded by num_faults * 10.

@bburda bburda requested a review from Copilot March 29, 2026 14:50
@bburda bburda self-assigned this Mar 29, 2026
@bburda bburda requested a review from mfaferek93 March 29, 2026 14:50
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 16 out of 16 changed files in this pull request and generated 4 comments.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 19 out of 19 changed files in this pull request and generated 2 comments.

Comment on lines +149 to +151
/// Set maximum snapshots per fault code (0 = unlimited)
virtual void set_max_snapshots_per_fault(size_t max_count) = 0;

Copy link

Copilot AI Mar 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FaultStorage is a public abstract interface, and adding a new pure-virtual method (set_max_snapshots_per_fault) is a source/ABI breaking change for any downstream storage implementations. If backwards compatibility matters, consider providing a default no-op implementation in the base class (non-pure virtual), or bump the relevant version / clearly mark this as a breaking change.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed - changed to virtual with default no-op to avoid breaking downstream.

Fixes unbounded snapshot accumulation when faults cycle between
CONFIRMED and CLEARED states (issue #308). Root cause: diagnostic
bridge forwarded every PASSED event without filtering, causing rapid
fault status cycling and snapshot capture on each re-confirmation.

Three-layer defense:

1. FaultReporter PASSED debounce: should_forward_passed() in
   LocalFilter applies same threshold/window filtering as FAILED.
   Previously report_passed() bypassed all local filtering.

2. Snapshot storage limit: set_max_snapshots_per_fault() on both
   SQLite and InMemory storage (default 10, reject-new strategy).
   Keeps earliest snapshots which are most valuable for diagnostics.

3. Snapshot recapture cooldown: snapshots.recapture_cooldown_sec
   parameter (default 60s) skips capture if same fault_code was
   captured within the cooldown window. Evicted on fault clear
   including auto-cleared correlation symptoms.

Also updates diagnostic_bridge integration test to send enough OK
diagnostics to meet the PASSED filtering threshold.

Closes #308
@bburda bburda force-pushed the fix/snapshot-recapture-protection branch from aa74acf to 9340878 Compare March 29, 2026 19:55
@bburda bburda added the bug Something isn't working label Mar 29, 2026
@bburda bburda force-pushed the fix/snapshot-recapture-protection branch from 9340878 to 7f5bbf3 Compare March 29, 2026 20:16
Add OperationManager::shutdown() called from GatewayNode destructor
to clear subscriptions, tracked goals, and service clients while
executor can still process pending callbacks safely.

Fix test action server to abort unfinished goals on shutdown -
destroying rclcpp_action::ServerGoalHandle without calling
succeed/cancel/abort triggers "terminate called without an active
exception" (SIGABRT). This was the root cause of persistent
test_operation_handlers crashes across all distros.
@bburda bburda force-pushed the fix/snapshot-recapture-protection branch from 7f5bbf3 to 3bf7c40 Compare March 29, 2026 20:24
@mfaferek93 mfaferek93 self-requested a review March 30, 2026 06:35
Copy link
Copy Markdown
Collaborator

@mfaferek93 mfaferek93 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@bburda bburda merged commit b05ac02 into main Mar 30, 2026
11 checks passed
@bburda bburda deleted the fix/snapshot-recapture-protection branch March 30, 2026 06:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BUG] Fault detail endpoint returns unbounded number of snapshots

3 participants