fix(gateway): prevent configurations endpoint deadlock on nodes without parameter service by mfaferek93 · Pull Request #323 · selfpatch/ros2_medkit

mfaferek93 · 2026-03-29T12:29:16Z

Summary

Prevent gateway HTTP deadlock when configurations endpoint queries nodes without ROS 2 parameter service (e.g., micro-ROS bridges, external nodes).

Root cause: ConfigurationManager used a single param_operations_mutex_ held for the entire parameter operation including blocking wait_for_service() calls. One slow/unavailable node blocked all configuration requests. Self-queries (gateway querying its own parameters) caused permanent deadlock.

Solution:

Negative cache - after a node parameter service fails to respond, subsequent requests return instantly with SERVICE_UNAVAILABLE for a configurable TTL (default 60s). Hard cap at 500 entries with oldest-entry eviction.
Self-query guard - gateway uses direct node_->get_parameters() for its own node instead of IPC. Covers all 5 operations (list, get, set, reset, reset_all).
Narrower spin_mutex_ scope - serializes only ROS 2 IPC calls, not JSON building or cache lookups. Single pre-created param_node_ for fast DDS service discovery.
Idempotent shutdown with atomic flag + post-shutdown guard on all public methods
Parameter validation via FloatingPointRange descriptors
Explicit config_mgr_->shutdown() in GatewayNode destructor

Test plan

22 unit tests pass (including new: negative cache, self-query, cross-method cache, concurrent)
58 integration tests pass (100%) including test_scenario_discovery_manifest
Tested on Jetson Orin Nano (aarch64) with ROSMASTER M3 Pro
Self-query returns 200 in 0.4s (was: permanent deadlock)
Micro-ROS nodes return instant SERVICE_UNAVAILABLE (was: 4-6s block + gateway freeze)

Fixes #318

Copilot

Pull request overview

Improves the gateway’s ROS 2 parameter querying (used by the configurations endpoints) to avoid deadlocks and reduce latency when target nodes don’t provide a parameter service, especially under concurrent requests.

Changes:

Reworks ConfigurationManager to use per-node mutexes, a negative cache for missing parameter services, and thread-local parameter client nodes.
Adds a self-query fast path for list/get operations to avoid IPC when querying the gateway’s own node.
Adds new unit tests intended to cover negative-cache behavior, self-query behavior, and parallelism.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 8 comments.

File	Description
`src/ros2_medkit_gateway/src/configuration_manager.cpp`	Implements negative caching, per-node locks, thread-local clients, self-query logic, and adjusts default timeout.
`src/ros2_medkit_gateway/include/ros2_medkit_gateway/configuration_manager.hpp`	Updates `ConfigurationManager` state to support per-node locks and negative cache.
`src/ros2_medkit_gateway/test/test_configuration_manager.cpp`	Adds new tests for negative cache, self-query behavior, and parallel queries.

Comments suppressed due to low confidence (2)

src/ros2_medkit_gateway/src/configuration_manager.cpp:676

Same classification issue as reset_parameter(): if cache_default_values() couldn’t run because the parameter service is missing, the first reset_all_parameters() call can return NO_DEFAULTS_CACHED (404) instead of SERVICE_UNAVAILABLE (503). Re-check unavailability after cache_default_values() or have it return a status so the caller can distinguish ‘no defaults cached yet’ from ‘cannot cache because service is unavailable’.

  try {
    cache_default_values(node_name);

    std::vector<rclcpp::Parameter> params_to_reset;
    {
      std::lock_guard<std::mutex> lock(defaults_mutex_);
      auto node_it = default_values_.find(node_name);
      if (node_it == default_values_.end()) {
        result.success = false;
        result.error_message = "No default values cached for node: " + node_name;
        result.error_code = ParameterErrorCode::NO_DEFAULTS_CACHED;
        return result;
      }

src/ros2_medkit_gateway/src/configuration_manager.cpp:601

If the node has no parameter service, the first reset_parameter() call can return NO_DEFAULTS_CACHED (404) because cache_default_values() returns early after marking the node unavailable, but the function then falls through to the defaults lookup. That misclassifies a service-availability issue as a missing-defaults issue. Consider re-checking is_node_unavailable(node_name) after cache_default_values() (or return a status from cache_default_values) and returning SERVICE_UNAVAILABLE when the cache attempt failed due to missing service.

  try {
    cache_default_values(node_name);

    rclcpp::Parameter default_param;
    {
      std::lock_guard<std::mutex> lock(defaults_mutex_);
      auto node_it = default_values_.find(node_name);
      if (node_it == default_values_.end()) {
        result.success = false;
        result.error_message = "No default values cached for node: " + node_name;
        result.error_code = ParameterErrorCode::NO_DEFAULTS_CACHED;
        return result;

src/ros2_medkit_gateway/src/configuration_manager.cpp

src/ros2_medkit_gateway/test/test_configuration_manager.cpp

src/ros2_medkit_gateway/src/configuration_manager.cpp

src/ros2_medkit_gateway/include/ros2_medkit_gateway/configuration_manager.hpp

src/ros2_medkit_gateway/test/test_configuration_manager.cpp

… cleanup, stable tests - Add self-query guard to set_parameter() and reset_parameter() (Copilot review) - Move cache_default_values() after wait_for_service() to avoid double timeout - Add lazy cleanup of expired negative cache entries (caps at 100 before sweep) - Remove timing-based test assertions, use serial vs parallel comparison instead

…est parallelism Address remaining review findings from PR #323 re-review: - Replace thread_local param nodes with ConfigurationManager-owned pool. Nodes are acquired/released via RAII (custom deleter on shared_ptr). shutdown() deterministically destroys all nodes before rclcpp::shutdown(). - Fix parallel test: use different node names for serial vs parallel phases so both phases actually hit parameter service timeouts (not negative cache). - Add lazy cleanup for node_mutexes_ map at >200 entries (removes unlocked mutexes for transient nodes).

…ut parameter service Replace single param_operations_mutex_ with per-node mutexes so one slow node doesn't block parameter queries to other nodes. Add negative cache for nodes whose parameter service is unavailable (skip for 60s instead of blocking on every request). Add self-query guard so the gateway never deadlocks querying its own parameter service via IPC - uses direct node_->get_parameters() instead. Use thread-local param nodes for SyncParametersClient to eliminate executor conflicts without a global mutex. Reduce default parameter_service_timeout_sec from 2.0s to 0.5s. Fixes #318

… cleanup, stable tests - Add self-query guard to set_parameter() and reset_parameter() (Copilot review) - Move cache_default_values() after wait_for_service() to avoid double timeout - Add lazy cleanup of expired negative cache entries (caps at 100 before sweep) - Remove timing-based test assertions, use serial vs parallel comparison instead

…ocument new parameters - Add shutdown() method and destructor to ConfigurationManager that cleans up thread-local ROS 2 param nodes via weak_ptr registry, preventing use-after-free when httplib threads outlive the ROS 2 context - Register each thread-local param node in registry on creation - Document parameter_service_timeout_sec (default 0.5s) and parameter_service_negative_cache_sec (default 60s) in gateway_params.yaml and docs/config/server.rst

…est parallelism Address remaining review findings from PR #323 re-review: - Replace thread_local param nodes with ConfigurationManager-owned pool. Nodes are acquired/released via RAII (custom deleter on shared_ptr). shutdown() deterministically destroys all nodes before rclcpp::shutdown(). - Fix parallel test: use different node names for serial vs parallel phases so both phases actually hit parameter service timeouts (not negative cache). - Add lazy cleanup for node_mutexes_ map at >200 entries (removes unlocked mutexes for transient nodes).

…eset_all, fix clang-format - list_own_parameters() now caches defaults for reset operations (fixes test_set_and_reset_parameter failure) - Add self-query guard to reset_all_parameters() (fixes test_concurrent_parameter_operations_no_executor_error - all 50 ops now go through direct access for own node) - Run clang-format to match CI style

…e default timeout Pool-based param nodes created fresh SyncParametersClient per call, breaking DDS service discovery caching and causing test_17 regression (test_scenario_discovery_manifest). Replace pool with per-target-node entries: each target gets a dedicated param_node + cached SyncParametersClient. Per-node mutex ensures no concurrent spin. Service discovery is cached across calls to the same target. shutdown() clears the map deterministically. Restore default parameter_service_timeout_sec to 2.0s (was 0.5s) - the lower value caused integration test failures. Users with micro-ROS nodes can set 0.5s in their config.

…I stability Per-target param_nodes caused DDS discovery delay on CI (fresh node needs time to join graph). Revert to single param_node_ created in constructor (like original code) for fast service discovery. Add recursive spin_mutex_ to serialize ROS 2 IPC calls (wait_for_service, list_parameters, get_parameters) on the shared param_node_. This prevents executor conflicts while keeping all other improvements: - Negative cache (instant return for nodes without parameter service) - Self-query guard (direct access for gateway's own node) - Cached SyncParametersClient per target node Remove per-node mutexes and node pool (no longer needed with spin_mutex). Replace parallel timing test with concurrent crash/hang test.

@pre

…hutdown - Replace recursive_mutex with std::mutex for spin_mutex_ (recursive was unnecessary - cache_default_values is always called with spin_mutex_ held, never re-acquires it) - Narrow spin_mutex_ scope in get_parameter: JSON building outside lock - shutdown() acquires spin_mutex_ before clearing param_node_ to wait for any in-flight IPC to complete (prevents teardown race) - Document @pre spin_mutex_ contract on cache_default_values

bburda

Review: configurations endpoint deadlock fix

The core approach is solid - self-query guard, negative cache, and narrower spin_mutex_ scope are well-designed solutions to a real production problem. The incremental commit history shows thoughtful iteration through different approaches, arriving at the simplest working solution.

A few things to address before merge:

PR description needs update

The description reflects earlier iterations that were tried and reverted during development:

"Replace single param_operations_mutex_ with per-node mutexes" - the final implementation uses a single spin_mutex_ (same serialization strategy, narrower scope)
"Thread-local param nodes for SyncParametersClient" - no thread_local in the final code
"Default parameter_service_timeout_sec reduced from 2.0s to 0.5s" - the default is 2.0s in the final code
"Concurrent requests to different nodes execute in parallel (per-node mutexes)" - they still serialize through spin_mutex_

GatewayNode shutdown sequence

GatewayNode::~GatewayNode() explicitly calls shutdown() on trigger_mgr_, resource_change_notifier_, plugin_mgr_, etc., but not on config_mgr_. The ConfigurationManager::shutdown() docstring says "Must be called before rclcpp::shutdown()." Since the REST server stops first, the practical risk is low, but for consistency with the other managers it's worth adding an explicit config_mgr_->shutdown() call.

Missing test coverage

The new tests cover the key happy paths, but a few gaps remain:

Negative cache only tested via list_parameters() - no test verifying cross-method cache hits (e.g., list_parameters marks node unavailable, then get_parameter on the same node returns from cache instantly)
No test for cache expiry (would need a short-TTL fixture)
No test for shutdown() lifecycle (idempotent? safe to call methods after?)
No test for disabled cache (parameter_service_negative_cache_sec: 0)

Inline comments below with specific findings.

src/ros2_medkit_gateway/include/ros2_medkit_gateway/configuration_manager.hpp

src/ros2_medkit_gateway/src/configuration_manager.cpp

src/ros2_medkit_gateway/test/test_configuration_manager.cpp

- Fix stale docstrings (thread-local -> shared param_node) - Fix member declaration order (spin_mutex_ before param_node_) - Add parameter range validation via ParameterDescriptor (FloatingPointRange) - Add idempotent shutdown with atomic flag + post-shutdown guard on all public methods - Add hard cap (500) on negative cache with oldest-entry eviction - Fix set_parameter self-query: check has_parameter() before get (404 not 500) - Fix reset_parameter self-query: narrow defaults_mutex_ scope (copy then release) - Fix concurrent test: verify results instead of void cast - Add cross-method negative cache test (list marks unavailable, get returns cached) - Add SHUT_DOWN error code for clean post-shutdown responses

…tructor For consistency with other managers (trigger_mgr_, plugin_mgr_, operation_mgr_), explicitly shut down ConfigurationManager before member destruction begins. Ensures param_node_ and cached clients are cleared while ROS 2 context is still alive.

Move cache_default_values() back before wait_for_service() in list_parameters(), matching the original code order. This gives nodes extra time for DDS service discovery (cache_default_values also calls wait_for_service internally). On slow CI VMs, the additional discovery window prevents test_17 flakiness.

… cleanup, stable tests - Add self-query guard to set_parameter() and reset_parameter() (Copilot review) - Move cache_default_values() after wait_for_service() to avoid double timeout - Add lazy cleanup of expired negative cache entries (caps at 100 before sweep) - Remove timing-based test assertions, use serial vs parallel comparison instead

…est parallelism Address remaining review findings from PR #323 re-review: - Replace thread_local param nodes with ConfigurationManager-owned pool. Nodes are acquired/released via RAII (custom deleter on shared_ptr). shutdown() deterministically destroys all nodes before rclcpp::shutdown(). - Fix parallel test: use different node names for serial vs parallel phases so both phases actually hit parameter service timeouts (not negative cache). - Add lazy cleanup for node_mutexes_ map at >200 entries (removes unlocked mutexes for transient nodes).

Copilot AI review requested due to automatic review settings March 29, 2026 12:29

Copilot started reviewing on behalf of mfaferek93 March 29, 2026 12:29 View session

Copilot AI reviewed Mar 29, 2026

View reviewed changes

mfaferek93 self-assigned this Mar 29, 2026

bburda self-requested a review March 30, 2026 06:52

mfaferek93 added 8 commits March 30, 2026 09:17

mfaferek93 force-pushed the fix/configurations-deadlock branch from 6df9308 to a09588d Compare March 30, 2026 07:17

bburda reviewed Mar 30, 2026

View reviewed changes

mfaferek93 added 3 commits March 30, 2026 09:44

bburda approved these changes Mar 30, 2026

View reviewed changes

mfaferek93 merged commit 85604be into main Mar 30, 2026
11 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(gateway): prevent configurations endpoint deadlock on nodes without parameter service#323

fix(gateway): prevent configurations endpoint deadlock on nodes without parameter service#323
mfaferek93 merged 11 commits intomainfrom
fix/configurations-deadlock

mfaferek93 commented Mar 29, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

bburda left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

mfaferek93 commented Mar 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

bburda left a comment

Choose a reason for hiding this comment

Review: configurations endpoint deadlock fix

PR description needs update

GatewayNode shutdown sequence

Missing test coverage

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

mfaferek93 commented Mar 29, 2026 •

edited

Loading