Skip to content

fix(mcp): evict pooled connections when tool discovery fails#1499

Open
MohanKumar21 wants to merge 4 commits into
lightseekorg:mainfrom
MohanKumar21:ksmkumar/fix-intermittent-wf-failure
Open

fix(mcp): evict pooled connections when tool discovery fails#1499
MohanKumar21 wants to merge 4 commits into
lightseekorg:mainfrom
MohanKumar21:ksmkumar/fix-intermittent-wf-failure

Conversation

@MohanKumar21
Copy link
Copy Markdown
Contributor

@MohanKumar21 MohanKumar21 commented May 15, 2026

Description

Problem

E2E MCP tool-calling tests (e2e_test/responses/test_tools_call.py) were flaky in CI: sometimes Brave/DeepWiki MCP worked, sometimes the model returned only reasoning + message with no mcp_list_tools or mcp_call.

Root cause: poisoned connection pool entries. When connect_dynamic_server first connected to a dynamic MCP server (e.g. http://brave-search-mcp:8080/mcp), it cached the connection in the pool. If list_all_tools() failed or returned before tools were registered (transient network, server not ready), the code only logged a warning and still treated the connect as successful. Later requests hit an early return because the connection was already pooled, skipped tool discovery entirely, and the tool inventory stayed empty. The model then saw McpListTools(tools=[]) and never invoked MCP tools until gateway restart.

Solution

Make pooled connections valid only when tools are actually registered. Retry tool discovery when a connection exists but the inventory is empty, and evict + fail the connect when discovery fails so the next request can reconnect cleanly.

Changes

  • crates/mcp/src/core/orchestrator.rs — connect_dynamic_server_with_tenant:
    - Fast path only when connection is pooled and tool_inventory.has_server_tools(url).
    - After get_or_create, call list_all_tools() if tools are still missing (re-list on cached client).
    - On list_all_tools() failure: remove connection from pool and return McpError::ConnectionFailed instead of succeeding with zero tools.
  • crates/mcp/src/inventory/index.rs — Add has_server_tools(server_key) to check whether any tools are registered for a server URL.
  • crates/mcp/src/core/pool.rs — Add remove(key) to evict bad pooled connections.

Test Plan

Re-run the openai-responses job that previously reported 8 failures in TestToolCallingCloud

Checklist
  • cargo +nightly fmt passes
  • cargo clippy --all-targets --all-features -- -D warnings passes
  • (Optional) Documentation updated
  • (Optional) Please join us on Slack #sig-smg to discuss, review, and merge PRs

Summary by CodeRabbit

  • New Features

    • Idle connection reaper to evict unused pooled connections.
    • Faster pool-key discovery fast path with a short-lived cached "empty discovery" window.
    • Inventory deduplicates tool entries and exposes pool-scoped tooling presence checks.
    • Pool tracks last-used timestamps and tool-discovery completion state.
  • Bug Fixes

    • Cleaner handling of failed pooled connections and reduced redundant discovery.
  • Tests

    • Updated tests for deduplication, nondeterministic lookup, and discovery state APIs.

Review Change Stack

@github-actions github-actions Bot added the mcp MCP related changes label May 15, 2026
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 15, 2026

Warning

Rate limit exceeded

@MohanKumar21 has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 54 minutes and 52 seconds before requesting another review.

You’ve run out of usage credits. Purchase more in the billing tab.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: c2cd75f4-c91a-405b-8e90-b27916cbcd92

📥 Commits

Reviewing files that changed from the base of the PR and between 9b12527 and 08c3731.

📒 Files selected for processing (2)
  • crates/mcp/src/core/orchestrator.rs
  • crates/mcp/src/core/pool.rs
📝 Walkthrough

Walkthrough

Adds pool-side tool-discovery state and removal APIs, background idle eviction, inventory simple-name deduplication (HashSet), tenant-typed PoolKey usage, and orchestrator discovery fast-paths with defensive inventory refresh and fail-fast pooled-connection cleanup.

Changes

Dynamic Server Connection Pooling & Tool Discovery

Layer / File(s) Summary
Orchestrator reaper and TTL
crates/mcp/src/core/orchestrator.rs
Introduces EMPTY_DISCOVERY_TTL, spawns spawn_idle_reaper to periodically evict idle pooled connections, and documents eviction behavior.
Orchestrator connection flow & discovery fast-paths
crates/mcp/src/core/orchestrator.rs
connect_dynamic_server_with_tenant accepts Option<TenantId>, uses inventory-aware pre/post fast paths to avoid redundant list calls, clones PoolKey into get_or_create closure, clears stale inventory before discovery, repopulates from list_all_tools, marks discovery completed on success, and removes pooled entry/returns ConnectionFailed on discovery failure when appropriate.
Pool key and tenant type
crates/mcp/src/core/pool.rs
PoolKey.tenant_id changed to Option<TenantId> and constructors updated to accept Option<TenantId>.
CachedConnection & last-used refresh
crates/mcp/src/core/pool.rs
CachedConnection gains tools_discovered: bool and last_used tracking; get_or_create/get refresh last_used on hits.
Pool APIs: discovery query/mark, eviction, remove
crates/mcp/src/core/pool.rs
Adds tool_discovery_completed() and mark_tool_discovery_completed() to read/set discovery state; updates evict_idle() to remove stale keys, decrement connection_count, and call eviction_callback per removal; adds remove() to evict by key and adjust counters. Tests updated for tenant-id usage and discovery APIs.
Inventory simple-name deduplication & pool index
crates/mcp/src/inventory/index.rs
tools_by_simple_name now stores HashSet<QualifiedToolName> for deduplication; get_tool() returns an arbitrary matching qualified tool under collisions; adds tools_by_pool, insert_pool_entry(), and has_server_tools(&PoolKey); updates clear/insert logic and tests to account for nondeterministic set iteration.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

Suggested labels

tests

Suggested reviewers

  • CatherineSue
  • key4ng
  • slin1237

Poem

🐇 A rabbit in the pool digs deep,
It marks the tools the servers keep,
When caches say the inventory's clear,
It skips the lists and hops on near,
If listing fails, it sweeps — then sleeps.

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and concisely summarizes the main purpose: fixing flaky connection pool behavior by evicting connections when tool discovery fails.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request enhances the MCP orchestrator's connection pooling by ensuring tool discovery is completed before returning a pooled connection and implementing a mechanism to remove failed connections from the pool. Review feedback suggests triggering the eviction callback during connection removal to keep the tool inventory synchronized, clearing existing tools before re-discovery to prevent stale entries, and optimizing performance on hot paths using interned strings.

Comment on lines +231 to +238
pub fn remove(&self, key: &PoolKey) -> bool {
let mut connections = self.connections.lock();
if connections.pop(key).is_some() {
self.connection_count.fetch_sub(1, Ordering::Relaxed);
true
} else {
false
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The remove method should trigger the eviction_callback when a connection is successfully removed. The McpOrchestrator relies on this callback to clear associated tools from the ToolInventory. Additionally, ensure that long-lived bidirectional streams in this pool implement an idle timeout to prevent resource exhaustion. Finally, since PoolKey is used on hot paths, consider using an interned string type like Arc<str> for its fields to improve performance.

Suggested change
pub fn remove(&self, key: &PoolKey) -> bool {
let mut connections = self.connections.lock();
if connections.pop(key).is_some() {
self.connection_count.fetch_sub(1, Ordering::Relaxed);
true
} else {
false
}
pub fn remove(&self, key: &PoolKey) -> bool {
let mut connections = self.connections.lock();
if connections.pop(key).is_some() {
self.connection_count.fetch_sub(1, Ordering::Relaxed);
if let Some(callback) = &self.eviction_callback {
callback(key);
}
true
} else {
false
}
}
References
  1. Implement an idle timeout for long-lived bidirectional streams to prevent inactive clients from indefinitely consuming server-side resources such as tasks and channels.
  2. For types that are frequently cloned on hot paths and represent a small, repeated set of values (e.g., worker IDs or tenant IDs), use an interned string type like Arc to improve performance.

Comment thread crates/mcp/src/core/orchestrator.rs Outdated
// Load tools from the server.
// If this fails, remove the connection from the pool so the next request
// retries from scratch instead of perpetually seeing zero tools.
match client.peer().list_all_tools().await {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

security-medium medium

Consider calling inventory_clone.clear_server_tools(&server_key) before re-listing tools to prevent stale or partial entries. To prevent vulnerabilities from duplicate entries, ensure the inventory uses a HashSet for tool storage. Additionally, ensure tool names are JSON-escaped before interpolation to handle special characters correctly.

        inventory_clone.clear_server_tools(&server_key);
        match client.peer().list_all_tools().await {
References
  1. To prevent vulnerabilities from duplicate entries, use data structures that inherently enforce uniqueness, such as HashSet, instead of manually deduplicating collections like Vec.
  2. Tool names should be JSON-escaped before interpolation to correctly handle special characters like quotes, backslashes, and unicode.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
crates/mcp/src/core/orchestrator.rs (1)

1334-1409: ⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

Don't use URL-scoped inventory to validate tenant/auth-scoped pool entries.

PoolKey is intentionally keyed by url + auth_hash + tenant_id, but both has_server_tools(&pool_key.url) checks only consult URL-scoped inventory. A second connection to the same URL with different credentials or tenant can therefore skip discovery and inherit whatever tool set another pool entry registered. The guard at Line 1405 makes this happen even when the exact PoolKey was just created. Reuse needs to be keyed by the same identity as the pooled connection, or this optimization breaks the isolation the pool key is supposed to provide.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/mcp/src/core/orchestrator.rs` around lines 1334 - 1409, The code
currently checks tools by URL only (has_server_tools(&pool_key.url)), which
breaks PoolKey isolation; update both fast-path and post-get_or_create checks to
consult the tool inventory using the full PoolKey identity
(url+auth_hash+tenant_id) instead of just the URL: introduce or call an
inventory method that accepts the PoolKey (e.g.,
has_tools_for_pool_key(&pool_key) or has_server_tools_with_auth(&pool_key)) and
use that in place of has_server_tools(&pool_key.url) (references: PoolKey,
pool_key, has_server_tools, connection_pool.get_or_create, inventory_clone,
tool_inventory). Ensure the inventory indexing stores/retrieves tools keyed by
the same PoolKey fields (auth_hash and tenant_id included) so pooled connections
with different credentials/tenants do not share tool lists.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In `@crates/mcp/src/core/orchestrator.rs`:
- Around line 1334-1409: The code currently checks tools by URL only
(has_server_tools(&pool_key.url)), which breaks PoolKey isolation; update both
fast-path and post-get_or_create checks to consult the tool inventory using the
full PoolKey identity (url+auth_hash+tenant_id) instead of just the URL:
introduce or call an inventory method that accepts the PoolKey (e.g.,
has_tools_for_pool_key(&pool_key) or has_server_tools_with_auth(&pool_key)) and
use that in place of has_server_tools(&pool_key.url) (references: PoolKey,
pool_key, has_server_tools, connection_pool.get_or_create, inventory_clone,
tool_inventory). Ensure the inventory indexing stores/retrieves tools keyed by
the same PoolKey fields (auth_hash and tenant_id included) so pooled connections
with different credentials/tenants do not share tool lists.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: e2c27fe5-dd69-4d0d-9e1d-df37db989b58

📥 Commits

Reviewing files that changed from the base of the PR and between 8c11f43 and 27f86d0.

📒 Files selected for processing (3)
  • crates/mcp/src/core/orchestrator.rs
  • crates/mcp/src/core/pool.rs
  • crates/mcp/src/inventory/index.rs

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 00ff20dc35

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread crates/mcp/src/core/orchestrator.rs Outdated
@MohanKumar21 MohanKumar21 force-pushed the ksmkumar/fix-intermittent-wf-failure branch from 00ff20d to 6279b39 Compare May 15, 2026 13:39
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@crates/mcp/src/core/orchestrator.rs`:
- Around line 1334-1345: The fast path in discover_or_get_server_url (checking
self.connection_pool.tool_discovery_completed(&pool_key) &&
self.tool_inventory.has_server_tools(&pool_key.url)) causes servers that
legitimately have zero tools to always fall through and re-run list_all_tools()
on each request; to fix, add and use a cached "discovered_empty" state (or a
discovered_tools_count option) keyed by pool_key.url in ToolInventory or
ConnectionPool and set it when list_all_tools() returns zero, check that flag
alongside tool_discovery_completed (e.g., treat discovered_empty as a valid
discovered state), and add a TTL or eviction policy so the discovered-empty
marker is refreshed periodically to avoid permanent staleness while preventing
repeated discovery calls for intentionally empty servers (refer to
tool_discovery_completed, has_server_tools, list_all_tools, and the
pool_key.url/ToolInventory APIs).
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 4364485e-4db5-4a1a-b4ae-576ff2dca80f

📥 Commits

Reviewing files that changed from the base of the PR and between 27f86d0 and 6279b39.

📒 Files selected for processing (3)
  • crates/mcp/src/core/orchestrator.rs
  • crates/mcp/src/core/pool.rs
  • crates/mcp/src/inventory/index.rs

Comment thread crates/mcp/src/core/orchestrator.rs Outdated
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 6279b39205

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment on lines +1341 to +1343
if self.connection_pool.tool_discovery_completed(&pool_key)
&& self.tool_inventory.has_server_tools(&pool_key.url)
{
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Allow fast path after successful empty tool discovery

mark_tool_discovery_completed is documented to treat empty list_all_tools results as successful discovery, but this fast path also requires has_server_tools(url), so zero-tool servers can never hit the cache path and will re-run discovery on every connect. In environments where a dynamic MCP endpoint intentionally exposes no tools (or temporarily returns none), this causes repeated network calls and avoidable ConnectionFailed errors from transient list_all_tools failures despite a previously successful discovery.

Useful? React with 👍 / 👎.

"Failed to list tools from '{}': {}; removing pooled connection",
server_key, e
);
self.connection_pool.remove(&pool_key);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Avoid evicting pooled client on concurrent discovery error

Concurrent callers can share the same pooled client and race through list_all_tools because discovery is not serialized per pool key; if one caller succeeds and another transiently fails, this unconditional remove evicts the healthy connection and triggers URL-level inventory clearing, undoing the successful discovery and causing temporary tool loss/flaky failures. The eviction should be conditional on discovery still being incomplete (or protected by a per-key discovery lock) before clearing shared state.

Useful? React with 👍 / 👎.

Signed-off-by: MohanKumar21! <mohanmrm20@gmail.com>
Signed-off-by: MohanKumar21! <mohanmrm20@gmail.com>
@MohanKumar21 MohanKumar21 force-pushed the ksmkumar/fix-intermittent-wf-failure branch from 3b75491 to 3c2b4a7 Compare May 18, 2026 05:51
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 3c2b4a7866

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread crates/mcp/src/core/orchestrator.rs Outdated
Comment on lines +301 to +304
let count = evicted.len();
if count > 0 {
self.connection_count.fetch_sub(count, Ordering::Relaxed);
if let Some(callback) = &self.eviction_callback {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Guard idle-eviction counter update against concurrent clear

evict_idle decrements connection_count after releasing the pool lock, so a concurrent clear() can reset the counter to 0 between removal and fetch_sub(count). In that race, fetch_sub underflows the usize counter and len()/stats() report a huge bogus value. Keep the decrement synchronized with the removal (or use saturating/CAS logic) to avoid counter corruption during shutdown or admin clears.

Useful? React with 👍 / 👎.

@MohanKumar21 MohanKumar21 force-pushed the ksmkumar/fix-intermittent-wf-failure branch from 3c2b4a7 to 1a6281e Compare May 18, 2026 07:56
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 1a6281e5e9

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread crates/mcp/src/core/orchestrator.rs Outdated
Comment on lines +1413 to +1416
&& (self.tool_inventory.has_server_tools(&pool_key.url)
|| self
.connection_pool
.discovery_fresh_within(&pool_key, EMPTY_DISCOVERY_TTL))
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Don’t bypass empty inventory on fresh discovery timestamps

The new fast-path condition returns early when tool_discovery_completed is true and discovery_fresh_within(...) passes, even if has_server_tools(url) is false. That reintroduces a tool-loss window: if a sibling tenant eviction clears URL-scoped inventory, requests for this pooled entry can skip re-discovery for up to EMPTY_DISCOVERY_TTL (60s) and serve with no MCP tools. Fresh evidence in this revision is the added || discovery_fresh_within(&pool_key, EMPTY_DISCOVERY_TTL) branch, which explicitly allows bypassing the inventory check.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@crates/mcp/src/core/pool.rs`:
- Around line 323-335: remove() currently invokes eviction_callback while
holding connections.lock(), risking deadlock; change it to detect and remove the
connection while holding the lock, record whether a callback exists (and the
key) into local variables, then drop the lock and call the callback afterwards.
Specifically, in the pub fn remove(&self, key: &PoolKey) -> bool function: use
connections.lock() only to call connections.pop(key) and update
connection_count.fetch_sub, capture a cloned/owned reference to
self.eviction_callback (or a boolean flag) and the key if present, release the
mutex, and then invoke the callback outside the lock (matching how evict_idle
does it).

In `@crates/mcp/src/inventory/index.rs`:
- Around line 332-342: The current has_server_tools(&self, server_key: &str) is
too coarse because pooled connections are keyed by (url, auth_hash, tenant_id);
update the inventory lookup to be tenant/auth-aware by changing has_server_tools
to accept the pool identity (e.g. auth_hash and tenant_id or a composite
pool_key) and check tools_by_server against that pool-scoped key (or add a new
tools_by_pool map keyed by the composite). Specifically, modify the
has_server_tools signature and all callers to pass the pool-scoped key, and
ensure the method uses tools_by_server (or the new tools_by_pool) to return true
only when the exact pool-scoped entry exists and is non-empty.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 34ed448f-3bf0-43a2-b560-21176bdf6f0c

📥 Commits

Reviewing files that changed from the base of the PR and between 6279b39 and 1a6281e.

📒 Files selected for processing (3)
  • crates/mcp/src/core/orchestrator.rs
  • crates/mcp/src/core/pool.rs
  • crates/mcp/src/inventory/index.rs

Comment thread crates/mcp/src/core/pool.rs
Comment thread crates/mcp/src/inventory/index.rs Outdated
Signed-off-by: MohanKumar21! <mohanmrm20@gmail.com>
@MohanKumar21 MohanKumar21 force-pushed the ksmkumar/fix-intermittent-wf-failure branch from 1a6281e to 9b12527 Compare May 18, 2026 09:28
Signed-off-by: MohanKumar21! <mohanmrm20@gmail.com>
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@crates/mcp/src/core/orchestrator.rs`:
- Around line 1483-1493: The orchestrator is unconditionally calling
metrics.record_connection_opened() for connections returned by
connection_pool.get_or_create(), causing over-counting when cached
Arc<McpClient> instances are reused; change get_or_create (in pool.rs) to return
an indicator of whether the connection was newly created (e.g., return
(Arc<McpClient>, bool is_new) or provide a separate creation-aware API) and
update the orchestrator call sites that currently call
record_connection_opened() (the branches that check
connection_pool.tool_discovery_completed(&pool_key) and subsequent
early-returns) to only call record_connection_opened() when is_new is true, or
alternatively move the metrics increment into the creation code path inside
connection_pool.get_or_create so cached hits do not increment
active_connections.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 4ae22887-b2b0-446f-9c59-699119be2294

📥 Commits

Reviewing files that changed from the base of the PR and between 1a6281e and 9b12527.

📒 Files selected for processing (3)
  • crates/mcp/src/core/orchestrator.rs
  • crates/mcp/src/core/pool.rs
  • crates/mcp/src/inventory/index.rs

Comment on lines +1483 to +1493
// Another caller may have completed discovery while we were connecting.
// Same combined check as the outer fast path — see note above.
if self.connection_pool.tool_discovery_completed(&pool_key)
&& (inventory_clone.has_server_tools(&pool_key)
|| self
.connection_pool
.discovery_fresh_within(&pool_key, EMPTY_DISCOVERY_TTL))
{
self.metrics.record_connection_opened();
return Ok(server_key);
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | 💤 Low value

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Check if get_or_create returns information about whether it was a cache hit
# and how metrics are expected to be recorded
rg -n -A 10 'fn get_or_create' crates/mcp/src/core/pool.rs

Repository: lightseekorg/smg

Length of output: 437


🏁 Script executed:

#!/bin/bash
# Get the full get_or_create implementation
cat -n crates/mcp/src/core/pool.rs | sed -n '172,250p'

Repository: lightseekorg/smg

Length of output: 3150


🏁 Script executed:

#!/bin/bash
# Get the full context around line 1491 and 1542 in orchestrator.rs
wc -l crates/mcp/src/core/orchestrator.rs

Repository: lightseekorg/smg

Length of output: 101


🏁 Script executed:

#!/bin/bash
# Find where record_connection_opened is defined and called
rg -n 'record_connection_opened' crates/mcp/src/

Repository: lightseekorg/smg

Length of output: 672


🏁 Script executed:

#!/bin/bash
# Get context around line 1491 and 1542
cat -n crates/mcp/src/core/orchestrator.rs | sed -n '1420,1555p'

Repository: lightseekorg/smg

Length of output: 7254


🏁 Script executed:

#!/bin/bash
# Check the metrics definition
cat -n crates/mcp/src/core/metrics.rs | sed -n '93,110p'

Repository: lightseekorg/smg

Length of output: 1025


🏁 Script executed:

#!/bin/bash
# Search for record_connection_closed to understand the full lifecycle
rg -n -B 2 -A 2 'record_connection_closed' crates/mcp/src/core/orchestrator.rs

Repository: lightseekorg/smg

Length of output: 226


Metrics may over-count when cached connections trigger record_connection_opened().

get_or_create returns an Arc<McpClient> without indicating whether it was a cache hit (line 186 in pool.rs) or a newly created connection (lines 190–210). The orchestrator then calls record_connection_opened() at lines 1491, 1528, and 1542 regardless of whether the connection was new or cached. Since record_connection_opened() increments active_connections (fetch_add), a reused cached connection will increment the counter multiple times across different requests, creating an accounting mismatch with record_connection_closed() which is only called once during shutdown.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/mcp/src/core/orchestrator.rs` around lines 1483 - 1493, The
orchestrator is unconditionally calling metrics.record_connection_opened() for
connections returned by connection_pool.get_or_create(), causing over-counting
when cached Arc<McpClient> instances are reused; change get_or_create (in
pool.rs) to return an indicator of whether the connection was newly created
(e.g., return (Arc<McpClient>, bool is_new) or provide a separate creation-aware
API) and update the orchestrator call sites that currently call
record_connection_opened() (the branches that check
connection_pool.tool_discovery_completed(&pool_key) and subsequent
early-returns) to only call record_connection_opened() when is_new is true, or
alternatively move the metrics increment into the creation code path inside
connection_pool.get_or_create so cached hits do not increment
active_connections.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 08c3731f55

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".


/// Clear all cached items for a server. Uses server index for O(tools_per_server) removal.
pub fn clear_server_tools(&self, server_key: &str) {
self.tools_by_pool.retain(|key, _| key.url != server_key);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Preserve sibling pool tool markers during URL clears

clear_server_tools now removes tools_by_pool entries for all PoolKeys that share a URL. Since connect_dynamic_server_with_tenant calls this on every successful discovery and then repopulates only the current pool key, discovery by tenant/auth variant A erases variant B’s marker; B then misses the fast path and re-runs list_all_tools on its next request. For multi-tenant traffic on one MCP URL, this introduces repeated rediscovery/network churn and makes tool availability flaky whenever one of those extra discovery calls transiently fails.

Useful? React with 👍 / 👎.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

mcp MCP related changes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant