feat(vector_io): Implement Contextual Retrieval for improved RAG search quality #4750

r-bit-rry · 2026-01-27T16:31:14Z

Implements Contextual Retrieval as described in Anthropic's engineering blog, enabling LLM-powered chunk contextualization during file ingestion for improved vector search quality.

Closes #4003

Motivation

Traditional RAG systems embed chunks in isolation, losing important document context. For example, a chunk stating "The company's revenue grew by 3% over the previous quarter" lacks context about which company or time period. Contextual Retrieval addresses this by using an LLM to prepend situational context to each chunk before embedding, significantly improving retrieval accuracy.

Changes

New Chunking Strategy: contextual

Added a new VectorStoreChunkingStrategyContextual type that can be specified when attaching files to vector stores:

client.vector_stores.files.create(
    vector_store_id=store_id,
    file_id=file_id,
    chunking_strategy={
        "type": "contextual",
        "contextual": {
            "model_id": "meta-llama/Llama-3.2-3B-Instruct",
            "max_chunk_size_tokens": 700,
            "chunk_overlap_tokens": 400,
        },
    },
)

Server-Level Configuration

Added ContextualRetrievalParams to VectorStoresConfig for server-level defaults, following the same pattern as RewriteQueryParams:

vector_stores_config:
  contextual_retrieval_params:
    model:
      provider_id: "ollama"
      model_id: "llama3.2:3b-instruct"
    default_timeout_seconds: 120
    default_max_concurrency: 3
    max_document_tokens: 100000

Implementation Details

Uses StrEnum pattern (_ChunkContextResult) for result tracking, following the HealthStatus pattern in the codebase
Async task results are aggregated after asyncio.gather completes (no shared mutable state or locks)
Uses asyncio semaphore for concurrent chunk processing (default: 3 concurrent calls)
Graceful degradation: partial failures log warnings but don't fail the entire operation
Total failure (all chunks fail) raises RuntimeError to prevent silent data loss
Empty context responses are logged and chunks remain unchanged
Document size validation prevents processing documents that exceed token limits

Design Decisions

Decision	Choice	Rationale
Model resolution	Explicit param > Config default > Error	Follows `RewriteQueryParams` pattern; provides flexibility while ensuring explicit configuration
Default timeout	120 seconds	Balances large document processing with reasonable wait times; no upper limit to support diverse models and hardware
Default concurrency	3 (min: 1, no upper limit)	Conservative default to avoid rate limiting; no upper limit for high-capacity deployments
Max document tokens	100,000 (min: 1,000, no upper limit)	Prevents memory issues; no upper limit to accommodate future models with larger context windows
Token estimation	`len(content) / 4`	Standard approximation; exact tokenization would add latency without significant benefit
Prompt template	Anthropic's recommended prompt	Proven effective; customizable via `context_prompt` parameter
No upper limits on config values	Lower bounds only	Future-proofs for evolving models and hardware; operators can configure based on their infrastructure
Result tracking	`StrEnum` with `asyncio.gather`	Idiomatic Python 3.12+; avoids locks and shared mutable state

Default Prompt Template

Uses the prompt from Anthropic's research:

<document>
{{WHOLE_DOCUMENT}}
</document>
Here is the chunk we want to situate within the whole document
<chunk>
{{CHUNK_CONTENT}}
</chunk>
Please give a short succinct context to situate this chunk within the overall document for the purposes of improving search retrieval of the chunk. Answer only with the succinct context and nothing else.

Testing

Unit Tests (11 tests):

Success case with string and list content
Partial and total failure handling
Custom prompt template
Empty response handling
Timeout handling
Config validation (model_id required, overlap < size, prompt placeholders)

Integration Tests (2 tests):

End-to-end contextual chunking with real LLM
Error case when no model configured

Future Considerations

Prompt Caching: Anthropic's implementation leverages prompt caching for cost reduction. This could be added as a separate enhancement when llama-stack adds explicit cache control support.
Batch Processing: For very large documents, batch processing with progress reporting could improve UX.

BREAKING CHANGE: The PR adds VectorStoreChunkingStrategyContextual to the API schema, which is a breaking change

mergify · 2026-01-27T16:31:54Z

This pull request has merge conflicts that must be resolved before it can be merged. @r-bit-rry please rebase it. https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

r-bit-rry · 2026-01-27T17:08:37Z

@franciscojavierarceo @leseb please review

franciscojavierarceo · 2026-01-27T17:43:03Z

src/llama_stack/core/datatypes.py

+    before embedding, improving search quality. See Anthropic's Contextual Retrieval.
+    """
+
+    model: QualifiedModel | None = Field(


we should definitely enable a default model in the stack and use that one.

CC @cdoern @leseb i think it's probably time to make an InferenceConfig

github-actions · 2026-01-27T18:19:17Z

✱ Stainless preview builds

This PR will update the llama-stack-client SDKs with the following commit message.

feat(vector_io): Implement Contextual Retrieval for improved RAG search quality

Edit this comment to update it. It will appear in the SDK's changelogs.

✅ llama-stack-client-node studio · code · diff

Your SDK built successfully.
generate ⚠️ → build ✅ → lint ✅ → test ❗
npm install https://pkg.stainless.com/s/llama-stack-client-node/b1dcb0db81980df1d76b04f51ef7bf58c315cc94/dist.tar.gz
New diagnostics (2 note)

💡 Model/Recommended: `#/components/schemas/VectorStoreChunkingStrategyContextual` could potentially be defined as a [model](https://www.stainless.com/docs/guides/configure#models) within `#/resources/vector_stores`.

💡 Model/Recommended: `#/components/schemas/VectorStoreChunkingStrategyContextualConfig` could potentially be defined as a [model](https://www.stainless.com/docs/guides/configure#models) within `#/resources/vector_stores`.

❗ llama-stack-client-kotlin studio

Unknown conclusion: fatal

New diagnostics (4 warning, 2 note)

⚠️ Java/NestedAndParentClassNamesConflict: This schema's class has the same name as one of its parent classes, so it will be renamed from `Contextual` to `InnerContextual`.

⚠️ Java/NestedAndParentClassNamesConflict: This schema's class has the same name as one of its parent classes, so it will be renamed from `Contextual` to `InnerContextual`.

⚠️ Java/NestedAndParentClassNamesConflict: This schema's class has the same name as one of its parent classes, so it will be renamed from `Contextual` to `InnerContextual`.

⚠️ Java/NestedAndParentClassNamesConflict: This schema's class has the same name as one of its parent classes, so it will be renamed from `Contextual` to `InnerContextual`.

💡 Model/Recommended: `#/components/schemas/VectorStoreChunkingStrategyContextual` could potentially be defined as a [model](https://www.stainless.com/docs/guides/configure#models) within `#/resources/vector_stores`.

💡 Model/Recommended: `#/components/schemas/VectorStoreChunkingStrategyContextualConfig` could potentially be defined as a [model](https://www.stainless.com/docs/guides/configure#models) within `#/resources/vector_stores`.

✅ llama-stack-client-python studio · conflict

Your SDK built successfully.

New diagnostics (3 note)

💡 Model/Recommended: `#/components/schemas/VectorStoreChunkingStrategyContextual` could potentially be defined as a [model](https://www.stainless.com/docs/guides/configure#models) within `#/resources/vector_stores`.

💡 Model/Recommended: `#/components/schemas/VectorStoreChunkingStrategyContextualConfig` could potentially be defined as a [model](https://www.stainless.com/docs/guides/configure#models) within `#/resources/vector_stores`.

💡 Python/NameNotAllowed: Encountered response property `model_id` which may conflict with Pydantic properties. Renamed to `api_model_id`.
Pydantic uses model_ as a protected namespace that shouldn't be used for attributes of our own API's models.
To provide a different name, use a merge transform.

✅ llama-stack-client-go studio · conflict

Your SDK built successfully.

New diagnostics (2 note)

💡 Model/Recommended: `#/components/schemas/VectorStoreChunkingStrategyContextual` could potentially be defined as a [model](https://www.stainless.com/docs/guides/configure#models) within `#/resources/vector_stores`.

💡 Model/Recommended: `#/components/schemas/VectorStoreChunkingStrategyContextualConfig` could potentially be defined as a [model](https://www.stainless.com/docs/guides/configure#models) within `#/resources/vector_stores`.

✅ llama-stack-client-openapi studio · code · diff

Your SDK built successfully.
generate ⚠️ → lint ⏳ → test ⏳

New diagnostics (2 note)

💡 Model/Recommended: `#/components/schemas/VectorStoreChunkingStrategyContextual` could potentially be defined as a [model](https://www.stainless.com/docs/guides/configure#models) within `#/resources/vector_stores`.

💡 Model/Recommended: `#/components/schemas/VectorStoreChunkingStrategyContextualConfig` could potentially be defined as a [model](https://www.stainless.com/docs/guides/configure#models) within `#/resources/vector_stores`.

This comment is auto-generated by GitHub Actions and is automatically kept up to date as you push.
Last updated: 2026-01-27 18:29:19 UTC

skamenan7

Great implementation! Thanks.

Looks like VectorStoreChunkingStrategyContextual and VectorStoreChunkingStrategyContextualConfig aren't in the__all__list in models.py. Otherwise they won't export properly. Please add them.

…rieval

r-bit-rry · 2026-01-28T14:20:57Z

tests/integration/vector_io/test_openai_vector_stores.py

        pytest.skip("No text model configured for contextual chunking test")

    compat_client = compat_client_with_empty_stores
+    if isinstance(compat_client, OpenAI):


The test_openai_vector_store_contextual_chunking test was failing for openai_client fixture because:

The OpenAI client has a hardcoded 30s timeout

Contextual chunking requires LLM calls that can take longer than 30s with Ollama

No recordings existed for the openai_client variant

feat(vector_io): implement contextual retrieval chunking strategy

87d69e3

meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Jan 27, 2026

mergify bot added the needs-rebase label Jan 27, 2026

Merge upstream/main into feat/contextual-retrieval

785893b

mergify bot removed the needs-rebase label Jan 27, 2026

r-bit-rry marked this pull request as ready for review January 27, 2026 17:08

r-bit-rry requested review from ashwinb, bbrowning, cdoern, ehhuang, franciscojavierarceo, leseb, mattf and raghotham as code owners January 27, 2026 17:08

franciscojavierarceo reviewed Jan 27, 2026

View reviewed changes

r-bit-rry added 2 commits January 27, 2026 19:43

fix ci tests

7ea6582

fix another ci issue

a7ae4e4

skamenan7 reviewed Jan 27, 2026

View reviewed changes

r-bit-rry added 4 commits January 27, 2026 21:44

test: add recordings for contextual chunking integration test

d256014

Merge remote-tracking branch 'upstream/main' into feat/contextual-ret…

7e4b37d

…rieval

fix: add VectorStoreChunkingStrategyContextual exports to __all__

7f4379e

test: skip OpenAI client variant for contextual chunking test

39b49d7

r-bit-rry commented Jan 28, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(vector_io): Implement Contextual Retrieval for improved RAG search quality #4750

feat(vector_io): Implement Contextual Retrieval for improved RAG search quality #4750

Uh oh!

r-bit-rry commented Jan 27, 2026 •

edited

Loading

Uh oh!

mergify bot commented Jan 27, 2026

Uh oh!

r-bit-rry commented Jan 27, 2026

Uh oh!

franciscojavierarceo Jan 27, 2026

Uh oh!

github-actions bot commented Jan 27, 2026 •

edited

Loading

Uh oh!

skamenan7 left a comment •

edited

Loading

Uh oh!

r-bit-rry Jan 28, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

feat(vector_io): Implement Contextual Retrieval for improved RAG search quality #4750

Are you sure you want to change the base?

feat(vector_io): Implement Contextual Retrieval for improved RAG search quality #4750

Uh oh!

Conversation

r-bit-rry commented Jan 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Changes

Design Decisions

Default Prompt Template

Testing

Future Considerations

Uh oh!

mergify bot commented Jan 27, 2026

Uh oh!

r-bit-rry commented Jan 27, 2026

Uh oh!

franciscojavierarceo Jan 27, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Jan 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✱ Stainless preview builds

Uh oh!

skamenan7 left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

r-bit-rry Jan 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

r-bit-rry commented Jan 27, 2026 •

edited

Loading

github-actions bot commented Jan 27, 2026 •

edited

Loading

skamenan7 left a comment •

edited

Loading

r-bit-rry Jan 28, 2026 •

edited

Loading