NIPS 25 | KVCOMM: Online Cross-context KV-cache Communication for Efficient LLM-based Multi-agent Systems

## 0  Paper Infomation
**Title:** KVCOMM: Online Cross-context KV-cache Communication for Efficient LLM-based Multi-agent Systems
**Authors:** Hancheng Ye1, Zhengqi Gao2, Mingyuan Ma1, Qinsi Wang1, Yuzhe Fu1, Ming-Yu Chung1, Yueqian Lin1, Zhijian Liu3, Jianyi Zhang1, Danyang Zhuo1, Yiran Chen1
**Institudes:** 1Duke University, 2MIT, 3NVIDIA
**Paper:** [[[2510.12872] KVCOMM: Online Cross-context KV-cache Communication for Efficient LLM-based Multi-agent Systems](https://arxiv.org/abs/2510.12872)](https://arxiv.org/abs/2510.12872)
**Github Note Link:** [[FastMAS/KVCOMM: [NeurIPS'25] KVCOMM: Online Cross-context KV-cache Communication for Efficient LLM-based Multi-agent Systems](https://github.com/FastMAS/KVCOMM)](https://github.com/FastMAS/KVCOMM)
**Openreview:** [[KVComm: Enabling Efficient LLM Communication through Selective KV Sharing | OpenReview](https://openreview.net/forum?id=F7rUng23nw)](https://openreview.net/forum?id=F7rUng23nw)
**Author Page:**  [[HankYe (Hancheng Ye)](https://github.com/HankYe)](https://github.com/HankYe)

## 1 Background and Core Problem
### 1.1  Background

Although multiple agents often share overlapping contexts (such as retrieved paragraphs or the outputs of their peers), they always redundantly recompute the KV cache for all input tokens, resulting in significantly low efficiency of pre-filling calculations, which is defined as the multi-context redundancy problem.
>For example, a single 8B Llama needs ∼430 ms to prefill a 3K-token prompt on one H100 GPU. If each of M agents receives messages from all of its peers, the total prefilling complexity of these repeated computations scales as O(M^2), posing inefficiency in the utilization of computation and a major challenge for real-time multi-agent collaboration.

For example, an 8B-parameter Llama model requires ~430 ms to prefill a 3K-token prompt on a single H100 GPU. If each of M agents receives messages from all its peers, the total prefilling complexity of these repeated computations scales as O(M^2), leading to low computational utilization and posing a major challenge for real-time multi-agent collaboration.

### 1.2  Why
Although different agents process the **same shared text**, each agent prepends its **own agent-specific prefix context** to it, which leads to two cascading problems:
1. **Positional encoding shift**: Modern LLMs commonly use RoPE (rotary position embedding); changes in prefix length cause rotational shifts in the absolute positions of subsequent tokens.
2. **Attention context change**: KV-cache is a product of attention computation; differences in prefix content directly alter the context distribution of the self-attention mechanism.

### 1.3  Offset-Variance Problem
The authors precisely define this phenomenon as the **Multi-context Redundancy** problem, and identify its fundamental challenge as the **Offset-Variance Problem of KV-cache**.

<img width="1127" height="840" alt="Image" src="https://github.com/user-attachments/assets/08fa931a-20e8-4a21-88ef-0e3444a372bc" />

Current work to reduce prefill overhead falls into four categories: prompt-level reuse, selective recomputation, cache compression, and kernel optimization. As shown in the figure, the authors compare their work with the two most relevant categories:
- **No Cache Reuse (baseline)**: Prefill all tokens from scratch for each request. Accurate but slow.
- **Selective Recomputation (e.g., [[CacheBlend](https://zhida.zhihu.com/search?content_id=273828963&content_type=Article&match_order=1&q=CacheBlend&zhida_source=entity)](https://zhida.zhihu.com/search?content_id=273828963&content_type=Article&match_order=1&q=CacheBlend&zhida_source=entity))**: Reuse most of the KV and only recompute the "critical part". Fix a certain ratio (e.g., 80%) and always follow it regardless of workload.
- **KVCOMM**: **Reuse all KV, but add a "context-aware offset" to each shared text segment** to align the deviations introduced by different prefixes. This offset is not recomputed, but **approximated from historically similar requests**.

In a nutshell: **Others "cut out part of the recomputation"; KVCOMM "recomputes nothing, but compensates for deviations by looking up historical anchors".**

## 2 Observation and Opportunity
### 2.1  Core Observations x2

<img width="1132" height="490" alt="Image" src="https://github.com/user-attachments/assets/925de099-7bd3-49c9-8c54-9bab92b59f43" />

**Question:**
The experiment in Figure 1a verifies that "the offset has structure", but does not yet solve the problem of "how to predict the offset". Therefore, we further consider: **If two tokens are close in the semantic space, will their KV offsets also be close?**

#### Conclusions
1. **The KV offset of the same token in different contexts is not random noise, but a structured deviation with high consistency** (Figure 1a)
2. **We must first align Keys to the same positional baseline via RoPE de-rotation before fairly comparing "content understanding deviation"** (Figure 1b)
3. **After positional alignment, the KV offset curves of two semantically similar tokens are highly coincident — proximity in embedding space can transfer to the KV-cache offset space** (Figure 1b)

### 2.2  Token-level KV proximity

>**Motivation**: "KVCOMM hinges on the empirical observation that per-token KV vectors remain remarkably similar across distinct conversational contexts as long as the model parameters are shared."

Motivation. KVCOMM is based on an empirical observation: as long as the model parameters are shared, the KV vectors of each token remain highly similar across different dialogue contexts.

<img width="1132" height="677" alt="Image" src="https://github.com/user-attachments/assets/f89250ec-eb4a-4114-a717-c5036aa17fde" />

#### Conclusions
1. Proposition 1: Under the same prefix, embedding distance constrains KV distance (Figure 4 a-d)
2. Proposition 2: Across different prefixes, embedding distance constrains offset distance (Figure 4 c-d)

## 3 System Design

### 3.1 Overall Architecture: Training-Free & Online

<img width="1132" height="773" alt="Image" src="https://github.com/user-attachments/assets/040f5713-8f9c-4df2-85fb-ca2fd9eaa0b8" />

1. **Initialization**: When the system starts, all agents precompute and cache the Base KV-cache of all Prefix Segments in their own prompt templates.
2. **Placeholder Readiness Check**: When a new request arrives, the agent checks whether the Base KV of the required Placeholder already exists in shared memory. If missing, it is computed in parallel to补齐.
3. **Reuse or Fallback**: The system determines whether the current Placeholder is reusable according to matching criteria (see 3.2). If not available, it falls back to standard Dense Prefill, and writes the newly generated KV and its offset into the Anchor Pool to expand coverage.
4. **Offset Approximation**: If all Placeholders match successfully, the system predicts the KV offset in the current context via anchor interpolation, updating the KV-cache of all Placeholders and adjacent prefixes in parallel.
5. **Decoding**: Concatenate the updated KV segments in order, **skip the Prefill phase entirely**, and start autoregressive decoding.
6. **Anchor Update**: After decoding is complete, the newly generated Response KV-cache enters the Anchor prediction module. If a similar sample exists in the pool, only metadata is updated; otherwise, it is added as a new anchor. When the pool capacity reaches the threshold `V`, the least active anchor is evicted using an LRU strategy, enabling adaptive memory management.

### 3.2 Anchor Pool Design
| Component               | Technical Definition                                                  | Physical Meaning                                   |
| ----------------------- | --------------------------------------------------------------------- | -------------------------------------------------- |
| **Base KV-cache**       | The KV of a Placeholder in an isolated context (without external prefix) | "Semantic representation of the text itself"      |
| **Placeholder Offset**  | The difference `Δ_ph` between the actual KV of this Placeholder in a specific agent context and the Base KV | "Understanding deviation caused by the current role/prefix" |
| **Prefix Offset**       | The KV offset `Δ_pf` of the fixed prefix segment immediately following this Placeholder | "Attention redistribution of subsequent connecting text after context injection" |

**Question: Why is Prefix Offset needed?**  
The paper points out that due to the Sink Attention mechanism of Transformers, local context dependencies are extremely strong. When a Placeholder is injected with new content, the attention distribution of the fixed prefix immediately following it (e.g., `"Answers from other agents are:"`) will be redirected. If only the Placeholder is updated while ignoring the offset of the adjacent prefix, semantic discontinuity will result.

### 3.3  Algorithm
1. Step 1: Offset Approximation
2. Step 2: Positional Alignment

## 4 Experiments and Results

### 4.1 Experimental Setup

| Dimension                | Specific Configuration                                                                                                 |
| ------------------------ | ---------------------------------------------------------------------------------------------------------------------- |
| **Multi-agent Architecture** | Fully-connected DAG, 2~5 agents collaborating, messages passed in one direction                                        |
| **Model Selection**      | Llama-3.1-8B-Instruct (RAG/math tasks); Qwen-2.5-Coder-7B-Instruct (code tasks)                                       |
| **Benchmark Datasets**   | MMLU (knowledge QA), GSM8K (math reasoning), HumanEval (code generation)                                              |
| **Comparison Baselines** | Original (no reuse), CacheBlend (selective recomputation, fixed 20% token recompute)                                   |
| **Evaluation Metrics**   | Accuracy/Pass@1; Reuse Rate, TTFT, average speedup                                                                     |
| **Hardware Environment** | Single NVIDIA H100 GPU, maximum generation length uniformly set to 512 tokens                                         |
### 4.2 Main Experiments: Lossless Accuracy, High Reuse Rate

<img width="1130" height="937" alt="Image" src="https://github.com/user-attachments/assets/2e512d38-9504-4184-a289-c04b2bc8cc35" />

**① Code tasks are extremely sensitive to cache accuracy**

> "In the HumanEval coding benchmark, KVCOMM delivers stable Pass@1 scores (81.4%–83.2%), significantly surpassing CacheBlend by an average margin of 53%."

**Analysis**
> "HumanEval agents produce code with many syntax separators (e.g., `.`, `;`, `!`), which induce diverse and prefix-sensitive KV-cache distributions. With CacheBlend's 20% recomputation, many sensitive tokens remain stale..."

**② In math reasoning tasks, KVCOMM shows significant stability**
- On GSM8K, CacheBlend's accuracy drops sharply as the number of agents increases (82.0% → 57.1%)
- KVCOMM drops only 1.9% (81.5% → 79.6%), always fluctuating within ±2% of the original baseline

**③ The Reuse Rate definition is stricter and more meaningful**
> "Note that the Reuse Rate for CacheBlend is defined as the proportion of tokens reusing KV-caches in whole token sequences, while the Reuse Rate of KVCOMM is defined as the frequency of agents reusing all KV-caches in the whole serving procedure."

This means that KVCOMM's 70%~87% reuse rate represents the frequency of **completely skipping the entire prefill phase**, leading to more thorough acceleration.

### 4.3  

<img width="564" height="277" alt="Image" src="https://github.com/user-attachments/assets/7f8bb399-58d5-4491-a8fd-a6a971d1bb86" />

>Table 2 reports TTFT per agent receiving 1K tokens from user input with 512 prefix tokens and sharing the 512 response tokens with succeeding agents. The first agent, lacking upstream caches (costing 86.6ms in “other” operations), shows modest acceleration (1.11×). Subsequent agents reduce prefilling dramatically to 26.9–38.6 ms via KVCOMM, achieving up to 7.82× speedup (Agent 5).


<img width="569" height="279" alt="Image" src="https://github.com/user-attachments/assets/66994832-8a8f-4925-868d-45a8d74e8c03" />

>We further examine scalability in Table 3, varying prefix (64–1K tokens) and output lengths (128–1K tokens) among three collaborating agents. KVCOMM achieves a minimum mean speedup of 2.24× (shortest setting) and scales effectively to 6.72× (longest setting), validating the approach’s efficiency gain as context length and complexity increase.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NIPS 25 | KVCOMM: Online Cross-context KV-cache Communication for Efficient LLM-based Multi-agent Systems #192

0 Paper Infomation

1 Background and Core Problem

1.1 Background

1.2 Why

1.3 Offset-Variance Problem

2 Observation and Opportunity

2.1 Core Observations x2

Conclusions

2.2 Token-level KV proximity

Conclusions

3 System Design

3.1 Overall Architecture: Training-Free & Online

3.2 Anchor Pool Design

3.3 Algorithm

4 Experiments and Results

4.1 Experimental Setup

4.2 Main Experiments: Lossless Accuracy, High Reuse Rate

4.3

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Component	Technical Definition	Physical Meaning
Base KV-cache	The KV of a Placeholder in an isolated context (without external prefix)	"Semantic representation of the text itself"
Placeholder Offset	The difference `Δ_ph` between the actual KV of this Placeholder in a specific agent context and the Base KV	"Understanding deviation caused by the current role/prefix"
Prefix Offset	The KV offset `Δ_pf` of the fixed prefix segment immediately following this Placeholder	"Attention redistribution of subsequent connecting text after context injection"

Dimension	Specific Configuration
Multi-agent Architecture	Fully-connected DAG, 2~5 agents collaborating, messages passed in one direction
Model Selection	Llama-3.1-8B-Instruct (RAG/math tasks); Qwen-2.5-Coder-7B-Instruct (code tasks)
Benchmark Datasets	MMLU (knowledge QA), GSM8K (math reasoning), HumanEval (code generation)
Comparison Baselines	Original (no reuse), CacheBlend (selective recomputation, fixed 20% token recompute)
Evaluation Metrics	Accuracy/Pass@1; Reuse Rate, TTFT, average speedup
Hardware Environment	Single NVIDIA H100 GPU, maximum generation length uniformly set to 512 tokens

NIPS 25 | KVCOMM: Online Cross-context KV-cache Communication for Efficient LLM-based Multi-agent Systems #192

Description

0 Paper Infomation

1 Background and Core Problem

1.1 Background

1.2 Why

1.3 Offset-Variance Problem

2 Observation and Opportunity

2.1 Core Observations x2

Conclusions

2.2 Token-level KV proximity

Conclusions

3 System Design

3.1 Overall Architecture: Training-Free & Online

3.2 Anchor Pool Design

3.3 Algorithm

4 Experiments and Results

4.1 Experimental Setup

4.2 Main Experiments: Lossless Accuracy, High Reuse Rate

4.3

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions