Skip to content

fix(gateway): prevent configurations endpoint deadlock on nodes without parameter service#323

Merged
mfaferek93 merged 11 commits intomainfrom
fix/configurations-deadlock
Mar 30, 2026
Merged

fix(gateway): prevent configurations endpoint deadlock on nodes without parameter service#323
mfaferek93 merged 11 commits intomainfrom
fix/configurations-deadlock

Conversation

@mfaferek93
Copy link
Copy Markdown
Collaborator

@mfaferek93 mfaferek93 commented Mar 29, 2026

Summary

Prevent gateway HTTP deadlock when configurations endpoint queries nodes without ROS 2 parameter service (e.g., micro-ROS bridges, external nodes).

Root cause: ConfigurationManager used a single param_operations_mutex_ held for the entire parameter operation including blocking wait_for_service() calls. One slow/unavailable node blocked all configuration requests. Self-queries (gateway querying its own parameters) caused permanent deadlock.

Solution:

  • Negative cache - after a node parameter service fails to respond, subsequent requests return instantly with SERVICE_UNAVAILABLE for a configurable TTL (default 60s). Hard cap at 500 entries with oldest-entry eviction.
  • Self-query guard - gateway uses direct node_->get_parameters() for its own node instead of IPC. Covers all 5 operations (list, get, set, reset, reset_all).
  • Narrower spin_mutex_ scope - serializes only ROS 2 IPC calls, not JSON building or cache lookups. Single pre-created param_node_ for fast DDS service discovery.
  • Idempotent shutdown with atomic flag + post-shutdown guard on all public methods
  • Parameter validation via FloatingPointRange descriptors
  • Explicit config_mgr_->shutdown() in GatewayNode destructor

Test plan

  • 22 unit tests pass (including new: negative cache, self-query, cross-method cache, concurrent)
  • 58 integration tests pass (100%) including test_scenario_discovery_manifest
  • Tested on Jetson Orin Nano (aarch64) with ROSMASTER M3 Pro
  • Self-query returns 200 in 0.4s (was: permanent deadlock)
  • Micro-ROS nodes return instant SERVICE_UNAVAILABLE (was: 4-6s block + gateway freeze)

Fixes #318

Copilot AI review requested due to automatic review settings March 29, 2026 12:29
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Improves the gateway’s ROS 2 parameter querying (used by the configurations endpoints) to avoid deadlocks and reduce latency when target nodes don’t provide a parameter service, especially under concurrent requests.

Changes:

  • Reworks ConfigurationManager to use per-node mutexes, a negative cache for missing parameter services, and thread-local parameter client nodes.
  • Adds a self-query fast path for list/get operations to avoid IPC when querying the gateway’s own node.
  • Adds new unit tests intended to cover negative-cache behavior, self-query behavior, and parallelism.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 8 comments.

File Description
src/ros2_medkit_gateway/src/configuration_manager.cpp Implements negative caching, per-node locks, thread-local clients, self-query logic, and adjusts default timeout.
src/ros2_medkit_gateway/include/ros2_medkit_gateway/configuration_manager.hpp Updates ConfigurationManager state to support per-node locks and negative cache.
src/ros2_medkit_gateway/test/test_configuration_manager.cpp Adds new tests for negative cache, self-query behavior, and parallel queries.
Comments suppressed due to low confidence (2)

src/ros2_medkit_gateway/src/configuration_manager.cpp:676

  • Same classification issue as reset_parameter(): if cache_default_values() couldn’t run because the parameter service is missing, the first reset_all_parameters() call can return NO_DEFAULTS_CACHED (404) instead of SERVICE_UNAVAILABLE (503). Re-check unavailability after cache_default_values() or have it return a status so the caller can distinguish ‘no defaults cached yet’ from ‘cannot cache because service is unavailable’.
  try {
    cache_default_values(node_name);

    std::vector<rclcpp::Parameter> params_to_reset;
    {
      std::lock_guard<std::mutex> lock(defaults_mutex_);
      auto node_it = default_values_.find(node_name);
      if (node_it == default_values_.end()) {
        result.success = false;
        result.error_message = "No default values cached for node: " + node_name;
        result.error_code = ParameterErrorCode::NO_DEFAULTS_CACHED;
        return result;
      }

src/ros2_medkit_gateway/src/configuration_manager.cpp:601

  • If the node has no parameter service, the first reset_parameter() call can return NO_DEFAULTS_CACHED (404) because cache_default_values() returns early after marking the node unavailable, but the function then falls through to the defaults lookup. That misclassifies a service-availability issue as a missing-defaults issue. Consider re-checking is_node_unavailable(node_name) after cache_default_values() (or return a status from cache_default_values) and returning SERVICE_UNAVAILABLE when the cache attempt failed due to missing service.
  try {
    cache_default_values(node_name);

    rclcpp::Parameter default_param;
    {
      std::lock_guard<std::mutex> lock(defaults_mutex_);
      auto node_it = default_values_.find(node_name);
      if (node_it == default_values_.end()) {
        result.success = false;
        result.error_message = "No default values cached for node: " + node_name;
        result.error_code = ParameterErrorCode::NO_DEFAULTS_CACHED;
        return result;

@mfaferek93 mfaferek93 self-assigned this Mar 29, 2026
mfaferek93 added a commit that referenced this pull request Mar 29, 2026
… cleanup, stable tests

- Add self-query guard to set_parameter() and reset_parameter() (Copilot review)
- Move cache_default_values() after wait_for_service() to avoid double timeout
- Add lazy cleanup of expired negative cache entries (caps at 100 before sweep)
- Remove timing-based test assertions, use serial vs parallel comparison instead
mfaferek93 added a commit that referenced this pull request Mar 29, 2026
…est parallelism

Address remaining review findings from PR #323 re-review:

- Replace thread_local param nodes with ConfigurationManager-owned pool.
  Nodes are acquired/released via RAII (custom deleter on shared_ptr).
  shutdown() deterministically destroys all nodes before rclcpp::shutdown().
- Fix parallel test: use different node names for serial vs parallel phases
  so both phases actually hit parameter service timeouts (not negative cache).
- Add lazy cleanup for node_mutexes_ map at >200 entries (removes unlocked
  mutexes for transient nodes).
@bburda bburda self-requested a review March 30, 2026 06:52
…ut parameter service

Replace single param_operations_mutex_ with per-node mutexes so one slow
node doesn't block parameter queries to other nodes. Add negative cache
for nodes whose parameter service is unavailable (skip for 60s instead of
blocking on every request). Add self-query guard so the gateway never
deadlocks querying its own parameter service via IPC - uses direct
node_->get_parameters() instead.

Use thread-local param nodes for SyncParametersClient to eliminate
executor conflicts without a global mutex. Reduce default
parameter_service_timeout_sec from 2.0s to 0.5s.

Fixes #318
… cleanup, stable tests

- Add self-query guard to set_parameter() and reset_parameter() (Copilot review)
- Move cache_default_values() after wait_for_service() to avoid double timeout
- Add lazy cleanup of expired negative cache entries (caps at 100 before sweep)
- Remove timing-based test assertions, use serial vs parallel comparison instead
…ocument new parameters

- Add shutdown() method and destructor to ConfigurationManager that cleans up
  thread-local ROS 2 param nodes via weak_ptr registry, preventing use-after-free
  when httplib threads outlive the ROS 2 context
- Register each thread-local param node in registry on creation
- Document parameter_service_timeout_sec (default 0.5s) and
  parameter_service_negative_cache_sec (default 60s) in gateway_params.yaml
  and docs/config/server.rst
…est parallelism

Address remaining review findings from PR #323 re-review:

- Replace thread_local param nodes with ConfigurationManager-owned pool.
  Nodes are acquired/released via RAII (custom deleter on shared_ptr).
  shutdown() deterministically destroys all nodes before rclcpp::shutdown().
- Fix parallel test: use different node names for serial vs parallel phases
  so both phases actually hit parameter service timeouts (not negative cache).
- Add lazy cleanup for node_mutexes_ map at >200 entries (removes unlocked
  mutexes for transient nodes).
…eset_all, fix clang-format

- list_own_parameters() now caches defaults for reset operations
  (fixes test_set_and_reset_parameter failure)
- Add self-query guard to reset_all_parameters() (fixes
  test_concurrent_parameter_operations_no_executor_error - all 50 ops
  now go through direct access for own node)
- Run clang-format to match CI style
…e default timeout

Pool-based param nodes created fresh SyncParametersClient per call,
breaking DDS service discovery caching and causing test_17 regression
(test_scenario_discovery_manifest).

Replace pool with per-target-node entries: each target gets a dedicated
param_node + cached SyncParametersClient. Per-node mutex ensures no
concurrent spin. Service discovery is cached across calls to the same
target. shutdown() clears the map deterministically.

Restore default parameter_service_timeout_sec to 2.0s (was 0.5s) -
the lower value caused integration test failures. Users with micro-ROS
nodes can set 0.5s in their config.
…I stability

Per-target param_nodes caused DDS discovery delay on CI (fresh node needs
time to join graph). Revert to single param_node_ created in constructor
(like original code) for fast service discovery.

Add recursive spin_mutex_ to serialize ROS 2 IPC calls (wait_for_service,
list_parameters, get_parameters) on the shared param_node_. This prevents
executor conflicts while keeping all other improvements:
- Negative cache (instant return for nodes without parameter service)
- Self-query guard (direct access for gateway's own node)
- Cached SyncParametersClient per target node

Remove per-node mutexes and node pool (no longer needed with spin_mutex).
Replace parallel timing test with concurrent crash/hang test.
…hutdown

- Replace recursive_mutex with std::mutex for spin_mutex_ (recursive
  was unnecessary - cache_default_values is always called with
  spin_mutex_ held, never re-acquires it)
- Narrow spin_mutex_ scope in get_parameter: JSON building outside lock
- shutdown() acquires spin_mutex_ before clearing param_node_ to wait
  for any in-flight IPC to complete (prevents teardown race)
- Document @pre spin_mutex_ contract on cache_default_values
@mfaferek93 mfaferek93 force-pushed the fix/configurations-deadlock branch from 6df9308 to a09588d Compare March 30, 2026 07:17
Copy link
Copy Markdown
Collaborator

@bburda bburda left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review: configurations endpoint deadlock fix

The core approach is solid - self-query guard, negative cache, and narrower spin_mutex_ scope are well-designed solutions to a real production problem. The incremental commit history shows thoughtful iteration through different approaches, arriving at the simplest working solution.

A few things to address before merge:

PR description needs update

The description reflects earlier iterations that were tried and reverted during development:

  • "Replace single param_operations_mutex_ with per-node mutexes" - the final implementation uses a single spin_mutex_ (same serialization strategy, narrower scope)
  • "Thread-local param nodes for SyncParametersClient" - no thread_local in the final code
  • "Default parameter_service_timeout_sec reduced from 2.0s to 0.5s" - the default is 2.0s in the final code
  • "Concurrent requests to different nodes execute in parallel (per-node mutexes)" - they still serialize through spin_mutex_

GatewayNode shutdown sequence

GatewayNode::~GatewayNode() explicitly calls shutdown() on trigger_mgr_, resource_change_notifier_, plugin_mgr_, etc., but not on config_mgr_. The ConfigurationManager::shutdown() docstring says "Must be called before rclcpp::shutdown()." Since the REST server stops first, the practical risk is low, but for consistency with the other managers it's worth adding an explicit config_mgr_->shutdown() call.

Missing test coverage

The new tests cover the key happy paths, but a few gaps remain:

  • Negative cache only tested via list_parameters() - no test verifying cross-method cache hits (e.g., list_parameters marks node unavailable, then get_parameter on the same node returns from cache instantly)
  • No test for cache expiry (would need a short-TTL fixture)
  • No test for shutdown() lifecycle (idempotent? safe to call methods after?)
  • No test for disabled cache (parameter_service_negative_cache_sec: 0)

Inline comments below with specific findings.

- Fix stale docstrings (thread-local -> shared param_node)
- Fix member declaration order (spin_mutex_ before param_node_)
- Add parameter range validation via ParameterDescriptor (FloatingPointRange)
- Add idempotent shutdown with atomic flag + post-shutdown guard on all public methods
- Add hard cap (500) on negative cache with oldest-entry eviction
- Fix set_parameter self-query: check has_parameter() before get (404 not 500)
- Fix reset_parameter self-query: narrow defaults_mutex_ scope (copy then release)
- Fix concurrent test: verify results instead of void cast
- Add cross-method negative cache test (list marks unavailable, get returns cached)
- Add SHUT_DOWN error code for clean post-shutdown responses
…tructor

For consistency with other managers (trigger_mgr_, plugin_mgr_,
operation_mgr_), explicitly shut down ConfigurationManager before
member destruction begins. Ensures param_node_ and cached clients
are cleared while ROS 2 context is still alive.
Move cache_default_values() back before wait_for_service() in
list_parameters(), matching the original code order. This gives nodes
extra time for DDS service discovery (cache_default_values also calls
wait_for_service internally). On slow CI VMs, the additional discovery
window prevents test_17 flakiness.
@mfaferek93 mfaferek93 merged commit 85604be into main Mar 30, 2026
11 checks passed
mfaferek93 added a commit that referenced this pull request Mar 30, 2026
… cleanup, stable tests

- Add self-query guard to set_parameter() and reset_parameter() (Copilot review)
- Move cache_default_values() after wait_for_service() to avoid double timeout
- Add lazy cleanup of expired negative cache entries (caps at 100 before sweep)
- Remove timing-based test assertions, use serial vs parallel comparison instead
mfaferek93 added a commit that referenced this pull request Mar 30, 2026
…est parallelism

Address remaining review findings from PR #323 re-review:

- Replace thread_local param nodes with ConfigurationManager-owned pool.
  Nodes are acquired/released via RAII (custom deleter on shared_ptr).
  shutdown() deterministically destroys all nodes before rclcpp::shutdown().
- Fix parallel test: use different node names for serial vs parallel phases
  so both phases actually hit parameter service timeouts (not negative cache).
- Add lazy cleanup for node_mutexes_ map at >200 entries (removes unlocked
  mutexes for transient nodes).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Gateway deadlocks when querying configurations on nodes without parameter service

3 participants