Skip to content

NIPS 25 | KVCOMM: Online Cross-context KV-cache Communication for Efficient LLM-based Multi-agent Systems #192

@Dszdsxc0405

Description

@Dszdsxc0405

0 Paper Infomation

Title: KVCOMM: Online Cross-context KV-cache Communication for Efficient LLM-based Multi-agent Systems
Authors: Hancheng Ye1, Zhengqi Gao2, Mingyuan Ma1, Qinsi Wang1, Yuzhe Fu1, Ming-Yu Chung1, Yueqian Lin1, Zhijian Liu3, Jianyi Zhang1, Danyang Zhuo1, Yiran Chen1
Institudes: 1Duke University, 2MIT, 3NVIDIA
Paper: [[2510.12872] KVCOMM: Online Cross-context KV-cache Communication for Efficient LLM-based Multi-agent Systems](https://arxiv.org/abs/2510.12872)
Github Note Link: [FastMAS/KVCOMM: [NeurIPS'25] KVCOMM: Online Cross-context KV-cache Communication for Efficient LLM-based Multi-agent Systems](https://github.com/FastMAS/KVCOMM)
Openreview: [KVComm: Enabling Efficient LLM Communication through Selective KV Sharing | OpenReview](https://openreview.net/forum?id=F7rUng23nw)
Author Page: [HankYe (Hancheng Ye)](https://github.com/HankYe)

1 Background and Core Problem

1.1 Background

Although multiple agents often share overlapping contexts (such as retrieved paragraphs or the outputs of their peers), they always redundantly recompute the KV cache for all input tokens, resulting in significantly low efficiency of pre-filling calculations, which is defined as the multi-context redundancy problem.

For example, a single 8B Llama needs ∼430 ms to prefill a 3K-token prompt on one H100 GPU. If each of M agents receives messages from all of its peers, the total prefilling complexity of these repeated computations scales as O(M^2), posing inefficiency in the utilization of computation and a major challenge for real-time multi-agent collaboration.

For example, an 8B-parameter Llama model requires ~430 ms to prefill a 3K-token prompt on a single H100 GPU. If each of M agents receives messages from all its peers, the total prefilling complexity of these repeated computations scales as O(M^2), leading to low computational utilization and posing a major challenge for real-time multi-agent collaboration.

1.2 Why

Although different agents process the same shared text, each agent prepends its own agent-specific prefix context to it, which leads to two cascading problems:

  1. Positional encoding shift: Modern LLMs commonly use RoPE (rotary position embedding); changes in prefix length cause rotational shifts in the absolute positions of subsequent tokens.
  2. Attention context change: KV-cache is a product of attention computation; differences in prefix content directly alter the context distribution of the self-attention mechanism.

1.3 Offset-Variance Problem

The authors precisely define this phenomenon as the Multi-context Redundancy problem, and identify its fundamental challenge as the Offset-Variance Problem of KV-cache.

Image

Current work to reduce prefill overhead falls into four categories: prompt-level reuse, selective recomputation, cache compression, and kernel optimization. As shown in the figure, the authors compare their work with the two most relevant categories:

In a nutshell: Others "cut out part of the recomputation"; KVCOMM "recomputes nothing, but compensates for deviations by looking up historical anchors".

2 Observation and Opportunity

2.1 Core Observations x2

Image

Question:
The experiment in Figure 1a verifies that "the offset has structure", but does not yet solve the problem of "how to predict the offset". Therefore, we further consider: If two tokens are close in the semantic space, will their KV offsets also be close?

Conclusions

  1. The KV offset of the same token in different contexts is not random noise, but a structured deviation with high consistency (Figure 1a)
  2. We must first align Keys to the same positional baseline via RoPE de-rotation before fairly comparing "content understanding deviation" (Figure 1b)
  3. After positional alignment, the KV offset curves of two semantically similar tokens are highly coincident — proximity in embedding space can transfer to the KV-cache offset space (Figure 1b)

2.2 Token-level KV proximity

Motivation: "KVCOMM hinges on the empirical observation that per-token KV vectors remain remarkably similar across distinct conversational contexts as long as the model parameters are shared."

Motivation. KVCOMM is based on an empirical observation: as long as the model parameters are shared, the KV vectors of each token remain highly similar across different dialogue contexts.

Image

Conclusions

  1. Proposition 1: Under the same prefix, embedding distance constrains KV distance (Figure 4 a-d)
  2. Proposition 2: Across different prefixes, embedding distance constrains offset distance (Figure 4 c-d)

3 System Design

3.1 Overall Architecture: Training-Free & Online

Image
  1. Initialization: When the system starts, all agents precompute and cache the Base KV-cache of all Prefix Segments in their own prompt templates.
  2. Placeholder Readiness Check: When a new request arrives, the agent checks whether the Base KV of the required Placeholder already exists in shared memory. If missing, it is computed in parallel to补齐.
  3. Reuse or Fallback: The system determines whether the current Placeholder is reusable according to matching criteria (see 3.2). If not available, it falls back to standard Dense Prefill, and writes the newly generated KV and its offset into the Anchor Pool to expand coverage.
  4. Offset Approximation: If all Placeholders match successfully, the system predicts the KV offset in the current context via anchor interpolation, updating the KV-cache of all Placeholders and adjacent prefixes in parallel.
  5. Decoding: Concatenate the updated KV segments in order, skip the Prefill phase entirely, and start autoregressive decoding.
  6. Anchor Update: After decoding is complete, the newly generated Response KV-cache enters the Anchor prediction module. If a similar sample exists in the pool, only metadata is updated; otherwise, it is added as a new anchor. When the pool capacity reaches the threshold V, the least active anchor is evicted using an LRU strategy, enabling adaptive memory management.

3.2 Anchor Pool Design

Component Technical Definition Physical Meaning
Base KV-cache The KV of a Placeholder in an isolated context (without external prefix) "Semantic representation of the text itself"
Placeholder Offset The difference Δ_ph between the actual KV of this Placeholder in a specific agent context and the Base KV "Understanding deviation caused by the current role/prefix"
Prefix Offset The KV offset Δ_pf of the fixed prefix segment immediately following this Placeholder "Attention redistribution of subsequent connecting text after context injection"

Question: Why is Prefix Offset needed?
The paper points out that due to the Sink Attention mechanism of Transformers, local context dependencies are extremely strong. When a Placeholder is injected with new content, the attention distribution of the fixed prefix immediately following it (e.g., "Answers from other agents are:") will be redirected. If only the Placeholder is updated while ignoring the offset of the adjacent prefix, semantic discontinuity will result.

3.3 Algorithm

  1. Step 1: Offset Approximation
  2. Step 2: Positional Alignment

4 Experiments and Results

4.1 Experimental Setup

Dimension Specific Configuration
Multi-agent Architecture Fully-connected DAG, 2~5 agents collaborating, messages passed in one direction
Model Selection Llama-3.1-8B-Instruct (RAG/math tasks); Qwen-2.5-Coder-7B-Instruct (code tasks)
Benchmark Datasets MMLU (knowledge QA), GSM8K (math reasoning), HumanEval (code generation)
Comparison Baselines Original (no reuse), CacheBlend (selective recomputation, fixed 20% token recompute)
Evaluation Metrics Accuracy/Pass@1; Reuse Rate, TTFT, average speedup
Hardware Environment Single NVIDIA H100 GPU, maximum generation length uniformly set to 512 tokens

4.2 Main Experiments: Lossless Accuracy, High Reuse Rate

Image

① Code tasks are extremely sensitive to cache accuracy

"In the HumanEval coding benchmark, KVCOMM delivers stable Pass@1 scores (81.4%–83.2%), significantly surpassing CacheBlend by an average margin of 53%."

Analysis

"HumanEval agents produce code with many syntax separators (e.g., ., ;, !), which induce diverse and prefix-sensitive KV-cache distributions. With CacheBlend's 20% recomputation, many sensitive tokens remain stale..."

② In math reasoning tasks, KVCOMM shows significant stability

  • On GSM8K, CacheBlend's accuracy drops sharply as the number of agents increases (82.0% → 57.1%)
  • KVCOMM drops only 1.9% (81.5% → 79.6%), always fluctuating within ±2% of the original baseline

③ The Reuse Rate definition is stricter and more meaningful

"Note that the Reuse Rate for CacheBlend is defined as the proportion of tokens reusing KV-caches in whole token sequences, while the Reuse Rate of KVCOMM is defined as the frequency of agents reusing all KV-caches in the whole serving procedure."

This means that KVCOMM's 70%~87% reuse rate represents the frequency of completely skipping the entire prefill phase, leading to more thorough acceleration.

4.3

Image

Table 2 reports TTFT per agent receiving 1K tokens from user input with 512 prefix tokens and sharing the 512 response tokens with succeeding agents. The first agent, lacking upstream caches (costing 86.6ms in “other” operations), shows modest acceleration (1.11×). Subsequent agents reduce prefilling dramatically to 26.9–38.6 ms via KVCOMM, achieving up to 7.82× speedup (Agent 5).

Image

We further examine scalability in Table 3, varying prefix (64–1K tokens) and output lengths (128–1K tokens) among three collaborating agents. KVCOMM achieves a minimum mean speedup of 2.24× (shortest setting) and scales effectively to 6.72× (longest setting), validating the approach’s efficiency gain as context length and complexity increase.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions