fix(mcp): evict pooled connections when tool discovery fails#1499
fix(mcp): evict pooled connections when tool discovery fails#1499MohanKumar21 wants to merge 4 commits into
Conversation
|
Warning Rate limit exceeded
You’ve run out of usage credits. Purchase more in the billing tab. ⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: Organization UI Review profile: ASSERTIVE Plan: Pro Run ID: 📒 Files selected for processing (2)
📝 WalkthroughWalkthroughAdds pool-side tool-discovery state and removal APIs, background idle eviction, inventory simple-name deduplication (HashSet), tenant-typed PoolKey usage, and orchestrator discovery fast-paths with defensive inventory refresh and fail-fast pooled-connection cleanup. ChangesDynamic Server Connection Pooling & Tool Discovery
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
Suggested labels
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Code Review
This pull request enhances the MCP orchestrator's connection pooling by ensuring tool discovery is completed before returning a pooled connection and implementing a mechanism to remove failed connections from the pool. Review feedback suggests triggering the eviction callback during connection removal to keep the tool inventory synchronized, clearing existing tools before re-discovery to prevent stale entries, and optimizing performance on hot paths using interned strings.
| pub fn remove(&self, key: &PoolKey) -> bool { | ||
| let mut connections = self.connections.lock(); | ||
| if connections.pop(key).is_some() { | ||
| self.connection_count.fetch_sub(1, Ordering::Relaxed); | ||
| true | ||
| } else { | ||
| false | ||
| } |
There was a problem hiding this comment.
The remove method should trigger the eviction_callback when a connection is successfully removed. The McpOrchestrator relies on this callback to clear associated tools from the ToolInventory. Additionally, ensure that long-lived bidirectional streams in this pool implement an idle timeout to prevent resource exhaustion. Finally, since PoolKey is used on hot paths, consider using an interned string type like Arc<str> for its fields to improve performance.
| pub fn remove(&self, key: &PoolKey) -> bool { | |
| let mut connections = self.connections.lock(); | |
| if connections.pop(key).is_some() { | |
| self.connection_count.fetch_sub(1, Ordering::Relaxed); | |
| true | |
| } else { | |
| false | |
| } | |
| pub fn remove(&self, key: &PoolKey) -> bool { | |
| let mut connections = self.connections.lock(); | |
| if connections.pop(key).is_some() { | |
| self.connection_count.fetch_sub(1, Ordering::Relaxed); | |
| if let Some(callback) = &self.eviction_callback { | |
| callback(key); | |
| } | |
| true | |
| } else { | |
| false | |
| } | |
| } |
References
- Implement an idle timeout for long-lived bidirectional streams to prevent inactive clients from indefinitely consuming server-side resources such as tasks and channels.
- For types that are frequently cloned on hot paths and represent a small, repeated set of values (e.g., worker IDs or tenant IDs), use an interned string type like Arc to improve performance.
| // Load tools from the server. | ||
| // If this fails, remove the connection from the pool so the next request | ||
| // retries from scratch instead of perpetually seeing zero tools. | ||
| match client.peer().list_all_tools().await { |
There was a problem hiding this comment.
Consider calling inventory_clone.clear_server_tools(&server_key) before re-listing tools to prevent stale or partial entries. To prevent vulnerabilities from duplicate entries, ensure the inventory uses a HashSet for tool storage. Additionally, ensure tool names are JSON-escaped before interpolation to handle special characters correctly.
inventory_clone.clear_server_tools(&server_key);
match client.peer().list_all_tools().await {References
- To prevent vulnerabilities from duplicate entries, use data structures that inherently enforce uniqueness, such as HashSet, instead of manually deduplicating collections like Vec.
- Tool names should be JSON-escaped before interpolation to correctly handle special characters like quotes, backslashes, and unicode.
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
crates/mcp/src/core/orchestrator.rs (1)
1334-1409:⚠️ Potential issue | 🟠 Major | 🏗️ Heavy liftDon't use URL-scoped inventory to validate tenant/auth-scoped pool entries.
PoolKeyis intentionally keyed byurl + auth_hash + tenant_id, but bothhas_server_tools(&pool_key.url)checks only consult URL-scoped inventory. A second connection to the same URL with different credentials or tenant can therefore skip discovery and inherit whatever tool set another pool entry registered. The guard at Line 1405 makes this happen even when the exactPoolKeywas just created. Reuse needs to be keyed by the same identity as the pooled connection, or this optimization breaks the isolation the pool key is supposed to provide.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@crates/mcp/src/core/orchestrator.rs` around lines 1334 - 1409, The code currently checks tools by URL only (has_server_tools(&pool_key.url)), which breaks PoolKey isolation; update both fast-path and post-get_or_create checks to consult the tool inventory using the full PoolKey identity (url+auth_hash+tenant_id) instead of just the URL: introduce or call an inventory method that accepts the PoolKey (e.g., has_tools_for_pool_key(&pool_key) or has_server_tools_with_auth(&pool_key)) and use that in place of has_server_tools(&pool_key.url) (references: PoolKey, pool_key, has_server_tools, connection_pool.get_or_create, inventory_clone, tool_inventory). Ensure the inventory indexing stores/retrieves tools keyed by the same PoolKey fields (auth_hash and tenant_id included) so pooled connections with different credentials/tenants do not share tool lists.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Outside diff comments:
In `@crates/mcp/src/core/orchestrator.rs`:
- Around line 1334-1409: The code currently checks tools by URL only
(has_server_tools(&pool_key.url)), which breaks PoolKey isolation; update both
fast-path and post-get_or_create checks to consult the tool inventory using the
full PoolKey identity (url+auth_hash+tenant_id) instead of just the URL:
introduce or call an inventory method that accepts the PoolKey (e.g.,
has_tools_for_pool_key(&pool_key) or has_server_tools_with_auth(&pool_key)) and
use that in place of has_server_tools(&pool_key.url) (references: PoolKey,
pool_key, has_server_tools, connection_pool.get_or_create, inventory_clone,
tool_inventory). Ensure the inventory indexing stores/retrieves tools keyed by
the same PoolKey fields (auth_hash and tenant_id included) so pooled connections
with different credentials/tenants do not share tool lists.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: e2c27fe5-dd69-4d0d-9e1d-df37db989b58
📒 Files selected for processing (3)
crates/mcp/src/core/orchestrator.rscrates/mcp/src/core/pool.rscrates/mcp/src/inventory/index.rs
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 00ff20dc35
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
00ff20d to
6279b39
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@crates/mcp/src/core/orchestrator.rs`:
- Around line 1334-1345: The fast path in discover_or_get_server_url (checking
self.connection_pool.tool_discovery_completed(&pool_key) &&
self.tool_inventory.has_server_tools(&pool_key.url)) causes servers that
legitimately have zero tools to always fall through and re-run list_all_tools()
on each request; to fix, add and use a cached "discovered_empty" state (or a
discovered_tools_count option) keyed by pool_key.url in ToolInventory or
ConnectionPool and set it when list_all_tools() returns zero, check that flag
alongside tool_discovery_completed (e.g., treat discovered_empty as a valid
discovered state), and add a TTL or eviction policy so the discovered-empty
marker is refreshed periodically to avoid permanent staleness while preventing
repeated discovery calls for intentionally empty servers (refer to
tool_discovery_completed, has_server_tools, list_all_tools, and the
pool_key.url/ToolInventory APIs).
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: 4364485e-4db5-4a1a-b4ae-576ff2dca80f
📒 Files selected for processing (3)
crates/mcp/src/core/orchestrator.rscrates/mcp/src/core/pool.rscrates/mcp/src/inventory/index.rs
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 6279b39205
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| if self.connection_pool.tool_discovery_completed(&pool_key) | ||
| && self.tool_inventory.has_server_tools(&pool_key.url) | ||
| { |
There was a problem hiding this comment.
Allow fast path after successful empty tool discovery
mark_tool_discovery_completed is documented to treat empty list_all_tools results as successful discovery, but this fast path also requires has_server_tools(url), so zero-tool servers can never hit the cache path and will re-run discovery on every connect. In environments where a dynamic MCP endpoint intentionally exposes no tools (or temporarily returns none), this causes repeated network calls and avoidable ConnectionFailed errors from transient list_all_tools failures despite a previously successful discovery.
Useful? React with 👍 / 👎.
| "Failed to list tools from '{}': {}; removing pooled connection", | ||
| server_key, e | ||
| ); | ||
| self.connection_pool.remove(&pool_key); |
There was a problem hiding this comment.
Avoid evicting pooled client on concurrent discovery error
Concurrent callers can share the same pooled client and race through list_all_tools because discovery is not serialized per pool key; if one caller succeeds and another transiently fails, this unconditional remove evicts the healthy connection and triggers URL-level inventory clearing, undoing the successful discovery and causing temporary tool loss/flaky failures. The eviction should be conditional on discovery still being incomplete (or protected by a per-key discovery lock) before clearing shared state.
Useful? React with 👍 / 👎.
Signed-off-by: MohanKumar21! <mohanmrm20@gmail.com>
Signed-off-by: MohanKumar21! <mohanmrm20@gmail.com>
3b75491 to
3c2b4a7
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 3c2b4a7866
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| let count = evicted.len(); | ||
| if count > 0 { | ||
| self.connection_count.fetch_sub(count, Ordering::Relaxed); | ||
| if let Some(callback) = &self.eviction_callback { |
There was a problem hiding this comment.
Guard idle-eviction counter update against concurrent clear
evict_idle decrements connection_count after releasing the pool lock, so a concurrent clear() can reset the counter to 0 between removal and fetch_sub(count). In that race, fetch_sub underflows the usize counter and len()/stats() report a huge bogus value. Keep the decrement synchronized with the removal (or use saturating/CAS logic) to avoid counter corruption during shutdown or admin clears.
Useful? React with 👍 / 👎.
3c2b4a7 to
1a6281e
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 1a6281e5e9
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| && (self.tool_inventory.has_server_tools(&pool_key.url) | ||
| || self | ||
| .connection_pool | ||
| .discovery_fresh_within(&pool_key, EMPTY_DISCOVERY_TTL)) |
There was a problem hiding this comment.
Don’t bypass empty inventory on fresh discovery timestamps
The new fast-path condition returns early when tool_discovery_completed is true and discovery_fresh_within(...) passes, even if has_server_tools(url) is false. That reintroduces a tool-loss window: if a sibling tenant eviction clears URL-scoped inventory, requests for this pooled entry can skip re-discovery for up to EMPTY_DISCOVERY_TTL (60s) and serve with no MCP tools. Fresh evidence in this revision is the added || discovery_fresh_within(&pool_key, EMPTY_DISCOVERY_TTL) branch, which explicitly allows bypassing the inventory check.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@crates/mcp/src/core/pool.rs`:
- Around line 323-335: remove() currently invokes eviction_callback while
holding connections.lock(), risking deadlock; change it to detect and remove the
connection while holding the lock, record whether a callback exists (and the
key) into local variables, then drop the lock and call the callback afterwards.
Specifically, in the pub fn remove(&self, key: &PoolKey) -> bool function: use
connections.lock() only to call connections.pop(key) and update
connection_count.fetch_sub, capture a cloned/owned reference to
self.eviction_callback (or a boolean flag) and the key if present, release the
mutex, and then invoke the callback outside the lock (matching how evict_idle
does it).
In `@crates/mcp/src/inventory/index.rs`:
- Around line 332-342: The current has_server_tools(&self, server_key: &str) is
too coarse because pooled connections are keyed by (url, auth_hash, tenant_id);
update the inventory lookup to be tenant/auth-aware by changing has_server_tools
to accept the pool identity (e.g. auth_hash and tenant_id or a composite
pool_key) and check tools_by_server against that pool-scoped key (or add a new
tools_by_pool map keyed by the composite). Specifically, modify the
has_server_tools signature and all callers to pass the pool-scoped key, and
ensure the method uses tools_by_server (or the new tools_by_pool) to return true
only when the exact pool-scoped entry exists and is non-empty.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: 34ed448f-3bf0-43a2-b560-21176bdf6f0c
📒 Files selected for processing (3)
crates/mcp/src/core/orchestrator.rscrates/mcp/src/core/pool.rscrates/mcp/src/inventory/index.rs
Signed-off-by: MohanKumar21! <mohanmrm20@gmail.com>
1a6281e to
9b12527
Compare
Signed-off-by: MohanKumar21! <mohanmrm20@gmail.com>
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@crates/mcp/src/core/orchestrator.rs`:
- Around line 1483-1493: The orchestrator is unconditionally calling
metrics.record_connection_opened() for connections returned by
connection_pool.get_or_create(), causing over-counting when cached
Arc<McpClient> instances are reused; change get_or_create (in pool.rs) to return
an indicator of whether the connection was newly created (e.g., return
(Arc<McpClient>, bool is_new) or provide a separate creation-aware API) and
update the orchestrator call sites that currently call
record_connection_opened() (the branches that check
connection_pool.tool_discovery_completed(&pool_key) and subsequent
early-returns) to only call record_connection_opened() when is_new is true, or
alternatively move the metrics increment into the creation code path inside
connection_pool.get_or_create so cached hits do not increment
active_connections.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: 4ae22887-b2b0-446f-9c59-699119be2294
📒 Files selected for processing (3)
crates/mcp/src/core/orchestrator.rscrates/mcp/src/core/pool.rscrates/mcp/src/inventory/index.rs
| // Another caller may have completed discovery while we were connecting. | ||
| // Same combined check as the outer fast path — see note above. | ||
| if self.connection_pool.tool_discovery_completed(&pool_key) | ||
| && (inventory_clone.has_server_tools(&pool_key) | ||
| || self | ||
| .connection_pool | ||
| .discovery_fresh_within(&pool_key, EMPTY_DISCOVERY_TTL)) | ||
| { | ||
| self.metrics.record_connection_opened(); | ||
| return Ok(server_key); | ||
| } |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Check if get_or_create returns information about whether it was a cache hit
# and how metrics are expected to be recorded
rg -n -A 10 'fn get_or_create' crates/mcp/src/core/pool.rsRepository: lightseekorg/smg
Length of output: 437
🏁 Script executed:
#!/bin/bash
# Get the full get_or_create implementation
cat -n crates/mcp/src/core/pool.rs | sed -n '172,250p'Repository: lightseekorg/smg
Length of output: 3150
🏁 Script executed:
#!/bin/bash
# Get the full context around line 1491 and 1542 in orchestrator.rs
wc -l crates/mcp/src/core/orchestrator.rsRepository: lightseekorg/smg
Length of output: 101
🏁 Script executed:
#!/bin/bash
# Find where record_connection_opened is defined and called
rg -n 'record_connection_opened' crates/mcp/src/Repository: lightseekorg/smg
Length of output: 672
🏁 Script executed:
#!/bin/bash
# Get context around line 1491 and 1542
cat -n crates/mcp/src/core/orchestrator.rs | sed -n '1420,1555p'Repository: lightseekorg/smg
Length of output: 7254
🏁 Script executed:
#!/bin/bash
# Check the metrics definition
cat -n crates/mcp/src/core/metrics.rs | sed -n '93,110p'Repository: lightseekorg/smg
Length of output: 1025
🏁 Script executed:
#!/bin/bash
# Search for record_connection_closed to understand the full lifecycle
rg -n -B 2 -A 2 'record_connection_closed' crates/mcp/src/core/orchestrator.rsRepository: lightseekorg/smg
Length of output: 226
Metrics may over-count when cached connections trigger record_connection_opened().
get_or_create returns an Arc<McpClient> without indicating whether it was a cache hit (line 186 in pool.rs) or a newly created connection (lines 190–210). The orchestrator then calls record_connection_opened() at lines 1491, 1528, and 1542 regardless of whether the connection was new or cached. Since record_connection_opened() increments active_connections (fetch_add), a reused cached connection will increment the counter multiple times across different requests, creating an accounting mismatch with record_connection_closed() which is only called once during shutdown.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@crates/mcp/src/core/orchestrator.rs` around lines 1483 - 1493, The
orchestrator is unconditionally calling metrics.record_connection_opened() for
connections returned by connection_pool.get_or_create(), causing over-counting
when cached Arc<McpClient> instances are reused; change get_or_create (in
pool.rs) to return an indicator of whether the connection was newly created
(e.g., return (Arc<McpClient>, bool is_new) or provide a separate creation-aware
API) and update the orchestrator call sites that currently call
record_connection_opened() (the branches that check
connection_pool.tool_discovery_completed(&pool_key) and subsequent
early-returns) to only call record_connection_opened() when is_new is true, or
alternatively move the metrics increment into the creation code path inside
connection_pool.get_or_create so cached hits do not increment
active_connections.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 08c3731f55
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
|
|
||
| /// Clear all cached items for a server. Uses server index for O(tools_per_server) removal. | ||
| pub fn clear_server_tools(&self, server_key: &str) { | ||
| self.tools_by_pool.retain(|key, _| key.url != server_key); |
There was a problem hiding this comment.
Preserve sibling pool tool markers during URL clears
clear_server_tools now removes tools_by_pool entries for all PoolKeys that share a URL. Since connect_dynamic_server_with_tenant calls this on every successful discovery and then repopulates only the current pool key, discovery by tenant/auth variant A erases variant B’s marker; B then misses the fast path and re-runs list_all_tools on its next request. For multi-tenant traffic on one MCP URL, this introduces repeated rediscovery/network churn and makes tool availability flaky whenever one of those extra discovery calls transiently fails.
Useful? React with 👍 / 👎.
Description
Problem
E2E MCP tool-calling tests (e2e_test/responses/test_tools_call.py) were flaky in CI: sometimes Brave/DeepWiki MCP worked, sometimes the model returned only reasoning + message with no mcp_list_tools or mcp_call.
Root cause: poisoned connection pool entries. When connect_dynamic_server first connected to a dynamic MCP server (e.g. http://brave-search-mcp:8080/mcp), it cached the connection in the pool. If list_all_tools() failed or returned before tools were registered (transient network, server not ready), the code only logged a warning and still treated the connect as successful. Later requests hit an early return because the connection was already pooled, skipped tool discovery entirely, and the tool inventory stayed empty. The model then saw McpListTools(tools=[]) and never invoked MCP tools until gateway restart.
Solution
Make pooled connections valid only when tools are actually registered. Retry tool discovery when a connection exists but the inventory is empty, and evict + fail the connect when discovery fails so the next request can reconnect cleanly.
Changes
- Fast path only when connection is pooled and tool_inventory.has_server_tools(url).
- After get_or_create, call list_all_tools() if tools are still missing (re-list on cached client).
- On list_all_tools() failure: remove connection from pool and return McpError::ConnectionFailed instead of succeeding with zero tools.
Test Plan
Re-run the openai-responses job that previously reported 8 failures in TestToolCallingCloud
Checklist
cargo +nightly fmtpassescargo clippy --all-targets --all-features -- -D warningspassesSummary by CodeRabbit
New Features
Bug Fixes
Tests