fix: MCP CPU spike by adding timeout to session cleanup #4758

jwm4 · 2026-01-28T14:22:14Z

Summary

When making MCP calls through the responses API, the llama-stack server CPU usage could spike to 100% and remain there indefinitely, even after the request completes.

Root Cause

The issue occurs during MCP session cleanup in MCPSessionManager.close_all(). When tasks don't respond to cancellation, anyio's _deliver_cancellation loop can spin indefinitely, causing the CPU spike.

Solution

Added a configurable timeout (default 5 seconds) to the __aexit__ calls using anyio.fail_after(). If cleanup takes longer than the timeout, it's aborted to prevent the CPU spin.

Testing

Verified that after the fix, CPU usage returns to idle levels after MCP requests complete
Existing error handling catches the TimeoutError from fail_after() gracefully

When making MCP calls through the responses API, the llama-stack server CPU usage could spike to 100% and remain there indefinitely due to anyio's _deliver_cancellation loop hanging during session cleanup. This fix adds a configurable timeout (default 5 seconds) to the __aexit__ calls in MCPSessionManager.close_all() using anyio.fail_after(). If cleanup takes longer than the timeout, it's aborted to prevent the CPU spin. Fixes llamastack#4754

mattf

please provide reproduction steps.

i did the following and still see 100% CPU usage -

10:53:24 in llama-stack on  fix/mcp-cpu-spike-timeout [$?] is 📦 0.4.0.dev0 …
➜ uv run llama stack run --providers agents=inline::meta-reference,inference=remote::llama-openai-compat,vector_io=inline::faiss,tool_runtime=inline::rag-runtime,files=inline::localfs
...
INFO     2026-01-28 10:53:34,588 uvicorn.error:216 uncategorized: Uvicorn running on http://['::', '0.0.0.0']:8321      
         (Press CTRL+C to quit)                                                                                         
INFO     2026-01-28 10:53:38,379 uvicorn.access:476 uncategorized: ::1:53190 - "POST /v1/responses HTTP/1.1" 200

10:53:35 in llama-stack on  fix/mcp-cpu-spike-timeout [$?] is 📦 0.4.0.dev0 …
➜ curl http://localhost:8321/v1/responses \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama-openai-compat/Llama-4-Scout-17B-16E-Instruct-FP8",
    "input": "Use the provided tool to say something.",
    "tools": [
      {
        "type": "mcp",
        "server_label": "local-mcp",
        "server_url": "http://localhost:9090"
      }
    ],
    "tool_choice": "auto"
  }'

derekhiggins · 2026-01-28T16:22:23Z

Also still seeing a problem
running https://github.com/derekhiggins/rhoai-auth-demo/blob/main/scripts/interactive-demo.py
python scripts/interactive-demo.py --user admin --tests mcp

This reverts commit 32f337a.

proposed alternative change

derekhiggins · 2026-01-29T10:49:12Z

lgtm, CPU spike gone can when using MCP
thanks both.

mergify · 2026-01-30T01:42:12Z

This pull request has merge conflicts that must be resolved before it can be merged. @jwm4 please rebase it. https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

jwm4 requested review from ashwinb, bbrowning, cdoern, ehhuang, franciscojavierarceo, leseb, mattf and raghotham as code owners January 28, 2026 14:22

meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Jan 28, 2026

jwm4 mentioned this pull request Jan 28, 2026

MCP CPU Spike #4754

Open

jwm4 changed the title ~~Fix MCP CPU spike by adding timeout to session cleanup~~ fix: MCP CPU spike by adding timeout to session cleanup Jan 28, 2026

mattf previously requested changes Jan 28, 2026

View reviewed changes

mattf added 2 commits January 28, 2026 11:50

Revert "Fix MCP CPU spike by adding timeout to session cleanup"

dc5f151

This reverts commit 32f337a.

convert MCPSessionManager to use a context manager protocol

fba08a9

Merge branch 'main' into fix/mcp-cpu-spike-timeout

4c836af

mattf mentioned this pull request Jan 29, 2026

Responses API with conversation_id set does not store conversation items in openai_conversations table #4718

Open

2 tasks

mergify bot added the needs-rebase label Jan 30, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: MCP CPU spike by adding timeout to session cleanup #4758

fix: MCP CPU spike by adding timeout to session cleanup #4758

jwm4 commented Jan 28, 2026

Uh oh!

mattf left a comment

Uh oh!

derekhiggins commented Jan 28, 2026

Uh oh!

derekhiggins commented Jan 29, 2026

Uh oh!

mergify bot commented Jan 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

fix: MCP CPU spike by adding timeout to session cleanup #4758

Are you sure you want to change the base?

fix: MCP CPU spike by adding timeout to session cleanup #4758

Conversation

jwm4 commented Jan 28, 2026

Summary

Root Cause

Solution

Testing

Uh oh!

mattf left a comment

Choose a reason for hiding this comment

Uh oh!

derekhiggins commented Jan 28, 2026

Uh oh!

derekhiggins commented Jan 29, 2026

Uh oh!

mergify bot commented Jan 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants