fix(gateway): prevent configurations endpoint deadlock on nodes without parameter service#323
Conversation
There was a problem hiding this comment.
Pull request overview
Improves the gateway’s ROS 2 parameter querying (used by the configurations endpoints) to avoid deadlocks and reduce latency when target nodes don’t provide a parameter service, especially under concurrent requests.
Changes:
- Reworks
ConfigurationManagerto use per-node mutexes, a negative cache for missing parameter services, and thread-local parameter client nodes. - Adds a self-query fast path for list/get operations to avoid IPC when querying the gateway’s own node.
- Adds new unit tests intended to cover negative-cache behavior, self-query behavior, and parallelism.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 8 comments.
| File | Description |
|---|---|
src/ros2_medkit_gateway/src/configuration_manager.cpp |
Implements negative caching, per-node locks, thread-local clients, self-query logic, and adjusts default timeout. |
src/ros2_medkit_gateway/include/ros2_medkit_gateway/configuration_manager.hpp |
Updates ConfigurationManager state to support per-node locks and negative cache. |
src/ros2_medkit_gateway/test/test_configuration_manager.cpp |
Adds new tests for negative cache, self-query behavior, and parallel queries. |
Comments suppressed due to low confidence (2)
src/ros2_medkit_gateway/src/configuration_manager.cpp:676
- Same classification issue as
reset_parameter(): ifcache_default_values()couldn’t run because the parameter service is missing, the firstreset_all_parameters()call can returnNO_DEFAULTS_CACHED(404) instead ofSERVICE_UNAVAILABLE(503). Re-check unavailability aftercache_default_values()or have it return a status so the caller can distinguish ‘no defaults cached yet’ from ‘cannot cache because service is unavailable’.
try {
cache_default_values(node_name);
std::vector<rclcpp::Parameter> params_to_reset;
{
std::lock_guard<std::mutex> lock(defaults_mutex_);
auto node_it = default_values_.find(node_name);
if (node_it == default_values_.end()) {
result.success = false;
result.error_message = "No default values cached for node: " + node_name;
result.error_code = ParameterErrorCode::NO_DEFAULTS_CACHED;
return result;
}
src/ros2_medkit_gateway/src/configuration_manager.cpp:601
- If the node has no parameter service, the first
reset_parameter()call can returnNO_DEFAULTS_CACHED(404) becausecache_default_values()returns early after marking the node unavailable, but the function then falls through to the defaults lookup. That misclassifies a service-availability issue as a missing-defaults issue. Consider re-checkingis_node_unavailable(node_name)aftercache_default_values()(or return a status fromcache_default_values) and returningSERVICE_UNAVAILABLEwhen the cache attempt failed due to missing service.
try {
cache_default_values(node_name);
rclcpp::Parameter default_param;
{
std::lock_guard<std::mutex> lock(defaults_mutex_);
auto node_it = default_values_.find(node_name);
if (node_it == default_values_.end()) {
result.success = false;
result.error_message = "No default values cached for node: " + node_name;
result.error_code = ParameterErrorCode::NO_DEFAULTS_CACHED;
return result;
src/ros2_medkit_gateway/include/ros2_medkit_gateway/configuration_manager.hpp
Outdated
Show resolved
Hide resolved
… cleanup, stable tests - Add self-query guard to set_parameter() and reset_parameter() (Copilot review) - Move cache_default_values() after wait_for_service() to avoid double timeout - Add lazy cleanup of expired negative cache entries (caps at 100 before sweep) - Remove timing-based test assertions, use serial vs parallel comparison instead
…est parallelism Address remaining review findings from PR #323 re-review: - Replace thread_local param nodes with ConfigurationManager-owned pool. Nodes are acquired/released via RAII (custom deleter on shared_ptr). shutdown() deterministically destroys all nodes before rclcpp::shutdown(). - Fix parallel test: use different node names for serial vs parallel phases so both phases actually hit parameter service timeouts (not negative cache). - Add lazy cleanup for node_mutexes_ map at >200 entries (removes unlocked mutexes for transient nodes).
…ut parameter service Replace single param_operations_mutex_ with per-node mutexes so one slow node doesn't block parameter queries to other nodes. Add negative cache for nodes whose parameter service is unavailable (skip for 60s instead of blocking on every request). Add self-query guard so the gateway never deadlocks querying its own parameter service via IPC - uses direct node_->get_parameters() instead. Use thread-local param nodes for SyncParametersClient to eliminate executor conflicts without a global mutex. Reduce default parameter_service_timeout_sec from 2.0s to 0.5s. Fixes #318
… cleanup, stable tests - Add self-query guard to set_parameter() and reset_parameter() (Copilot review) - Move cache_default_values() after wait_for_service() to avoid double timeout - Add lazy cleanup of expired negative cache entries (caps at 100 before sweep) - Remove timing-based test assertions, use serial vs parallel comparison instead
…ocument new parameters - Add shutdown() method and destructor to ConfigurationManager that cleans up thread-local ROS 2 param nodes via weak_ptr registry, preventing use-after-free when httplib threads outlive the ROS 2 context - Register each thread-local param node in registry on creation - Document parameter_service_timeout_sec (default 0.5s) and parameter_service_negative_cache_sec (default 60s) in gateway_params.yaml and docs/config/server.rst
…est parallelism Address remaining review findings from PR #323 re-review: - Replace thread_local param nodes with ConfigurationManager-owned pool. Nodes are acquired/released via RAII (custom deleter on shared_ptr). shutdown() deterministically destroys all nodes before rclcpp::shutdown(). - Fix parallel test: use different node names for serial vs parallel phases so both phases actually hit parameter service timeouts (not negative cache). - Add lazy cleanup for node_mutexes_ map at >200 entries (removes unlocked mutexes for transient nodes).
…eset_all, fix clang-format - list_own_parameters() now caches defaults for reset operations (fixes test_set_and_reset_parameter failure) - Add self-query guard to reset_all_parameters() (fixes test_concurrent_parameter_operations_no_executor_error - all 50 ops now go through direct access for own node) - Run clang-format to match CI style
…e default timeout Pool-based param nodes created fresh SyncParametersClient per call, breaking DDS service discovery caching and causing test_17 regression (test_scenario_discovery_manifest). Replace pool with per-target-node entries: each target gets a dedicated param_node + cached SyncParametersClient. Per-node mutex ensures no concurrent spin. Service discovery is cached across calls to the same target. shutdown() clears the map deterministically. Restore default parameter_service_timeout_sec to 2.0s (was 0.5s) - the lower value caused integration test failures. Users with micro-ROS nodes can set 0.5s in their config.
…I stability Per-target param_nodes caused DDS discovery delay on CI (fresh node needs time to join graph). Revert to single param_node_ created in constructor (like original code) for fast service discovery. Add recursive spin_mutex_ to serialize ROS 2 IPC calls (wait_for_service, list_parameters, get_parameters) on the shared param_node_. This prevents executor conflicts while keeping all other improvements: - Negative cache (instant return for nodes without parameter service) - Self-query guard (direct access for gateway's own node) - Cached SyncParametersClient per target node Remove per-node mutexes and node pool (no longer needed with spin_mutex). Replace parallel timing test with concurrent crash/hang test.
…hutdown - Replace recursive_mutex with std::mutex for spin_mutex_ (recursive was unnecessary - cache_default_values is always called with spin_mutex_ held, never re-acquires it) - Narrow spin_mutex_ scope in get_parameter: JSON building outside lock - shutdown() acquires spin_mutex_ before clearing param_node_ to wait for any in-flight IPC to complete (prevents teardown race) - Document @pre spin_mutex_ contract on cache_default_values
6df9308 to
a09588d
Compare
bburda
left a comment
There was a problem hiding this comment.
Review: configurations endpoint deadlock fix
The core approach is solid - self-query guard, negative cache, and narrower spin_mutex_ scope are well-designed solutions to a real production problem. The incremental commit history shows thoughtful iteration through different approaches, arriving at the simplest working solution.
A few things to address before merge:
PR description needs update
The description reflects earlier iterations that were tried and reverted during development:
- "Replace single
param_operations_mutex_with per-node mutexes" - the final implementation uses a singlespin_mutex_(same serialization strategy, narrower scope) - "Thread-local param nodes for
SyncParametersClient" - nothread_localin the final code - "Default
parameter_service_timeout_secreduced from 2.0s to 0.5s" - the default is 2.0s in the final code - "Concurrent requests to different nodes execute in parallel (per-node mutexes)" - they still serialize through
spin_mutex_
GatewayNode shutdown sequence
GatewayNode::~GatewayNode() explicitly calls shutdown() on trigger_mgr_, resource_change_notifier_, plugin_mgr_, etc., but not on config_mgr_. The ConfigurationManager::shutdown() docstring says "Must be called before rclcpp::shutdown()." Since the REST server stops first, the practical risk is low, but for consistency with the other managers it's worth adding an explicit config_mgr_->shutdown() call.
Missing test coverage
The new tests cover the key happy paths, but a few gaps remain:
- Negative cache only tested via
list_parameters()- no test verifying cross-method cache hits (e.g.,list_parametersmarks node unavailable, thenget_parameteron the same node returns from cache instantly) - No test for cache expiry (would need a short-TTL fixture)
- No test for
shutdown()lifecycle (idempotent? safe to call methods after?) - No test for disabled cache (
parameter_service_negative_cache_sec: 0)
Inline comments below with specific findings.
src/ros2_medkit_gateway/include/ros2_medkit_gateway/configuration_manager.hpp
Outdated
Show resolved
Hide resolved
src/ros2_medkit_gateway/include/ros2_medkit_gateway/configuration_manager.hpp
Outdated
Show resolved
Hide resolved
- Fix stale docstrings (thread-local -> shared param_node) - Fix member declaration order (spin_mutex_ before param_node_) - Add parameter range validation via ParameterDescriptor (FloatingPointRange) - Add idempotent shutdown with atomic flag + post-shutdown guard on all public methods - Add hard cap (500) on negative cache with oldest-entry eviction - Fix set_parameter self-query: check has_parameter() before get (404 not 500) - Fix reset_parameter self-query: narrow defaults_mutex_ scope (copy then release) - Fix concurrent test: verify results instead of void cast - Add cross-method negative cache test (list marks unavailable, get returns cached) - Add SHUT_DOWN error code for clean post-shutdown responses
…tructor For consistency with other managers (trigger_mgr_, plugin_mgr_, operation_mgr_), explicitly shut down ConfigurationManager before member destruction begins. Ensures param_node_ and cached clients are cleared while ROS 2 context is still alive.
Move cache_default_values() back before wait_for_service() in list_parameters(), matching the original code order. This gives nodes extra time for DDS service discovery (cache_default_values also calls wait_for_service internally). On slow CI VMs, the additional discovery window prevents test_17 flakiness.
… cleanup, stable tests - Add self-query guard to set_parameter() and reset_parameter() (Copilot review) - Move cache_default_values() after wait_for_service() to avoid double timeout - Add lazy cleanup of expired negative cache entries (caps at 100 before sweep) - Remove timing-based test assertions, use serial vs parallel comparison instead
…est parallelism Address remaining review findings from PR #323 re-review: - Replace thread_local param nodes with ConfigurationManager-owned pool. Nodes are acquired/released via RAII (custom deleter on shared_ptr). shutdown() deterministically destroys all nodes before rclcpp::shutdown(). - Fix parallel test: use different node names for serial vs parallel phases so both phases actually hit parameter service timeouts (not negative cache). - Add lazy cleanup for node_mutexes_ map at >200 entries (removes unlocked mutexes for transient nodes).
Summary
Prevent gateway HTTP deadlock when configurations endpoint queries nodes without ROS 2 parameter service (e.g., micro-ROS bridges, external nodes).
Root cause: ConfigurationManager used a single param_operations_mutex_ held for the entire parameter operation including blocking wait_for_service() calls. One slow/unavailable node blocked all configuration requests. Self-queries (gateway querying its own parameters) caused permanent deadlock.
Solution:
Test plan
Fixes #318