Skip to content

Conversation

@r-bit-rry
Copy link
Contributor

@r-bit-rry r-bit-rry commented Jan 27, 2026

Implements Contextual Retrieval as described in Anthropic's engineering blog, enabling LLM-powered chunk contextualization during file ingestion for improved vector search quality.

Closes #4003

Motivation

Traditional RAG systems embed chunks in isolation, losing important document context. For example, a chunk stating "The company's revenue grew by 3% over the previous quarter" lacks context about which company or time period. Contextual Retrieval addresses this by using an LLM to prepend situational context to each chunk before embedding, significantly improving retrieval accuracy.

Changes

New Chunking Strategy: contextual

Added a new VectorStoreChunkingStrategyContextual type that can be specified when attaching files to vector stores:

client.vector_stores.files.create(
    vector_store_id=store_id,
    file_id=file_id,
    chunking_strategy={
        "type": "contextual",
        "contextual": {
            "model_id": "meta-llama/Llama-3.2-3B-Instruct",
            "max_chunk_size_tokens": 700,
            "chunk_overlap_tokens": 400,
        },
    },
)

Server-Level Configuration

Added ContextualRetrievalParams to VectorStoresConfig for server-level defaults, following the same pattern as RewriteQueryParams:

vector_stores_config:
  contextual_retrieval_params:
    model:
      provider_id: "ollama"
      model_id: "llama3.2:3b-instruct"
    default_timeout_seconds: 120
    default_max_concurrency: 3
    max_document_tokens: 100000

Implementation Details

  • Uses StrEnum pattern (_ChunkContextResult) for result tracking, following the HealthStatus pattern in the codebase
  • Async task results are aggregated after asyncio.gather completes (no shared mutable state or locks)
  • Uses asyncio semaphore for concurrent chunk processing (default: 3 concurrent calls)
  • Graceful degradation: partial failures log warnings but don't fail the entire operation
  • Total failure (all chunks fail) raises RuntimeError to prevent silent data loss
  • Empty context responses are logged and chunks remain unchanged
  • Document size validation prevents processing documents that exceed token limits

Design Decisions

Decision Choice Rationale
Model resolution Explicit param > Config default > Error Follows RewriteQueryParams pattern; provides flexibility while ensuring explicit configuration
Default timeout 120 seconds Balances large document processing with reasonable wait times; no upper limit to support diverse models and hardware
Default concurrency 3 (min: 1, no upper limit) Conservative default to avoid rate limiting; no upper limit for high-capacity deployments
Max document tokens 100,000 (min: 1,000, no upper limit) Prevents memory issues; no upper limit to accommodate future models with larger context windows
Token estimation len(content) / 4 Standard approximation; exact tokenization would add latency without significant benefit
Prompt template Anthropic's recommended prompt Proven effective; customizable via context_prompt parameter
No upper limits on config values Lower bounds only Future-proofs for evolving models and hardware; operators can configure based on their infrastructure
Result tracking StrEnum with asyncio.gather Idiomatic Python 3.12+; avoids locks and shared mutable state

Default Prompt Template

Uses the prompt from Anthropic's research:

<document>
{{WHOLE_DOCUMENT}}
</document>
Here is the chunk we want to situate within the whole document
<chunk>
{{CHUNK_CONTENT}}
</chunk>
Please give a short succinct context to situate this chunk within the overall document for the purposes of improving search retrieval of the chunk. Answer only with the succinct context and nothing else.

Testing

Unit Tests (11 tests):

  • Success case with string and list content
  • Partial and total failure handling
  • Custom prompt template
  • Empty response handling
  • Timeout handling
  • Config validation (model_id required, overlap < size, prompt placeholders)

Integration Tests (2 tests):

  • End-to-end contextual chunking with real LLM
  • Error case when no model configured

Future Considerations

  • Prompt Caching: Anthropic's implementation leverages prompt caching for cost reduction. This could be added as a separate enhancement when llama-stack adds explicit cache control support.
  • Batch Processing: For very large documents, batch processing with progress reporting could improve UX.

BREAKING CHANGE: The PR adds VectorStoreChunkingStrategyContextual to the API schema, which is a breaking change

@meta-cla meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Jan 27, 2026
@mergify
Copy link

mergify bot commented Jan 27, 2026

This pull request has merge conflicts that must be resolved before it can be merged. @r-bit-rry please rebase it. https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Jan 27, 2026
@mergify mergify bot removed the needs-rebase label Jan 27, 2026
@r-bit-rry r-bit-rry marked this pull request as ready for review January 27, 2026 17:08
@r-bit-rry
Copy link
Contributor Author

@franciscojavierarceo @leseb please review

before embedding, improving search quality. See Anthropic's Contextual Retrieval.
"""

model: QualifiedModel | None = Field(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should definitely enable a default model in the stack and use that one.

CC @cdoern @leseb i think it's probably time to make an InferenceConfig

@github-actions
Copy link
Contributor

github-actions bot commented Jan 27, 2026

✱ Stainless preview builds

This PR will update the llama-stack-client SDKs with the following commit message.

feat(vector_io): Implement Contextual Retrieval for improved RAG search quality

Edit this comment to update it. It will appear in the SDK's changelogs.

llama-stack-client-node studio · code · diff

Your SDK built successfully.
generate ⚠️build ✅lint ✅test ❗

npm install https://pkg.stainless.com/s/llama-stack-client-node/b1dcb0db81980df1d76b04f51ef7bf58c315cc94/dist.tar.gz
New diagnostics (2 note)
💡 Model/Recommended: `#/components/schemas/VectorStoreChunkingStrategyContextual` could potentially be defined as a [model](https://www.stainless.com/docs/guides/configure#models) within `#/resources/vector_stores`.
💡 Model/Recommended: `#/components/schemas/VectorStoreChunkingStrategyContextualConfig` could potentially be defined as a [model](https://www.stainless.com/docs/guides/configure#models) within `#/resources/vector_stores`.
llama-stack-client-kotlin studio

Unknown conclusion: fatal

New diagnostics (4 warning, 2 note)
⚠️ Java/NestedAndParentClassNamesConflict: This schema's class has the same name as one of its parent classes, so it will be renamed from `Contextual` to `InnerContextual`.
⚠️ Java/NestedAndParentClassNamesConflict: This schema's class has the same name as one of its parent classes, so it will be renamed from `Contextual` to `InnerContextual`.
⚠️ Java/NestedAndParentClassNamesConflict: This schema's class has the same name as one of its parent classes, so it will be renamed from `Contextual` to `InnerContextual`.
⚠️ Java/NestedAndParentClassNamesConflict: This schema's class has the same name as one of its parent classes, so it will be renamed from `Contextual` to `InnerContextual`.
💡 Model/Recommended: `#/components/schemas/VectorStoreChunkingStrategyContextual` could potentially be defined as a [model](https://www.stainless.com/docs/guides/configure#models) within `#/resources/vector_stores`.
💡 Model/Recommended: `#/components/schemas/VectorStoreChunkingStrategyContextualConfig` could potentially be defined as a [model](https://www.stainless.com/docs/guides/configure#models) within `#/resources/vector_stores`.
llama-stack-client-python studio · conflict

Your SDK built successfully.

New diagnostics (3 note)

💡 Model/Recommended: `#/components/schemas/VectorStoreChunkingStrategyContextual` could potentially be defined as a [model](https://www.stainless.com/docs/guides/configure#models) within `#/resources/vector_stores`.
💡 Model/Recommended: `#/components/schemas/VectorStoreChunkingStrategyContextualConfig` could potentially be defined as a [model](https://www.stainless.com/docs/guides/configure#models) within `#/resources/vector_stores`.
💡 Python/NameNotAllowed: Encountered response property `model_id` which may conflict with Pydantic properties. Renamed to `api_model_id`.

Pydantic uses model_ as a protected namespace that shouldn't be used for attributes of our own API's models.
To provide a different name, use a merge transform.

llama-stack-client-go studio · conflict

Your SDK built successfully.

New diagnostics (2 note)
💡 Model/Recommended: `#/components/schemas/VectorStoreChunkingStrategyContextual` could potentially be defined as a [model](https://www.stainless.com/docs/guides/configure#models) within `#/resources/vector_stores`.
💡 Model/Recommended: `#/components/schemas/VectorStoreChunkingStrategyContextualConfig` could potentially be defined as a [model](https://www.stainless.com/docs/guides/configure#models) within `#/resources/vector_stores`.
llama-stack-client-openapi studio · code · diff

Your SDK built successfully.
generate ⚠️lint ⏳test ⏳

New diagnostics (2 note)
💡 Model/Recommended: `#/components/schemas/VectorStoreChunkingStrategyContextual` could potentially be defined as a [model](https://www.stainless.com/docs/guides/configure#models) within `#/resources/vector_stores`.
💡 Model/Recommended: `#/components/schemas/VectorStoreChunkingStrategyContextualConfig` could potentially be defined as a [model](https://www.stainless.com/docs/guides/configure#models) within `#/resources/vector_stores`.

This comment is auto-generated by GitHub Actions and is automatically kept up to date as you push.
Last updated: 2026-01-27 18:29:19 UTC

Copy link
Contributor

@skamenan7 skamenan7 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great implementation! Thanks.

Looks like VectorStoreChunkingStrategyContextual and VectorStoreChunkingStrategyContextualConfig aren't in the__all__list in models.py. Otherwise they won't export properly. Please add them.

pytest.skip("No text model configured for contextual chunking test")

compat_client = compat_client_with_empty_stores
if isinstance(compat_client, OpenAI):
Copy link
Contributor Author

@r-bit-rry r-bit-rry Jan 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test_openai_vector_store_contextual_chunking test was failing for openai_client fixture because:

  • The OpenAI client has a hardcoded 30s timeout
  • Contextual chunking requires LLM calls that can take longer than 30s with Ollama
  • No recordings existed for the openai_client variant

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Implement Contextual Retrieval and Contextual Preprocessing

3 participants