feat: migrate Inference API to FastAPI router system #4755

r-bit-rry · 2026-01-28T09:36:26Z

What does this PR do?

This PR migrates the Inference API to the FastAPI router system, building on the work done in PR #4191. This continues the effort to move away from the legacy @webmethod decorator system to explicit FastAPI routers.

Changes

Inference API Migration: Migrated the Inference API to use FastAPI routers following the established API package structure pattern
SSE Streaming Support: Added SSE utilities for streaming inference endpoints (chat completions, completions)
OpenAPI Spec Updates: Updated OpenAPI specifications and Stainless config for the new router structure
Documentation Updates: Updated tutorial examples to use the new Agent API patterns

Implementation Details

Protocol definitions and models live in llama_stack_api/inference/
FastAPI router implementation follows the established pattern from other migrated APIs
The /v1alpha/inference/rerank endpoint is properly configured in the Stainless config
Explicit 200 responses added for streaming endpoints to properly document SSE behavior

This represents an incremental migration of the Inference API to the router system while maintaining full backward compatibility.

Test Plan

Verify routes are preserved:

curl http://localhost:8321/v1/inspect/routes | jq '.data[] | select(.route | contains("inference") or contains("chat") or contains("completion") or contains("embedding"))'

Run unit tests:

uv run pytest tests/unit/core/ -v

Run integration tests:

uv run pytest tests/integration/inference/ -vv --stack-config=http://localhost:8321

Co-authored-by: Gerald Trotman gtrotman@redhat.com (@JayDi11a)

This PR supersedes #4445 with a clean, rebased history.

BREAKING CHANGES: update OpenAIEmbeddingsRequestWithExtraBody to support token array

Update demo script to use the newer LlamaStackClient and Agent API instead of the manual OpenAI client approach. Changes: - Switch from OpenAI client to LlamaStackClient - Use Agent API for simplified RAG implementation - Auto-select models with preference for Ollama (no API key needed) - Reduce code complexity from ~136 to ~102 lines - Remove manual RAG implementation in favor of agentic approach This provides a cleaner, more modern example for users getting started with Llama Stack.

Simplify the Ollama model selection logic in the detailed tutorial. Changes: - Replace complex custom_metadata filtering with simple ID check - Use direct 'ollama' in model ID check instead of metadata lookup - Makes code more concise and easier to understand This aligns with the simplified approach used in the updated demo_script.py.

Update the agent examples to use the latest API methods. Changes: - Simplify model selection (already applied in previous commit) - Use response.output_text instead of response.output_message.content - Use direct print(event) instead of event.print() for streaming This aligns the tutorial with the current Agent API implementation.

Modernize the RAG agent example to use the latest Vector Stores API. Changes: - Replace deprecated VectorDB API with Vector Stores API - Use file upload and vector_stores.create() instead of rag_tool.insert() - Download files via requests and upload to Llama Stack - Update to use file_search tool type with vector_store_ids - Simplify model selection with Ollama preference - Improve logging and user feedback - Update event logging to handle both old and new API - Add note about known server routing issues This provides a more accurate example using current Llama Stack APIs.

Fix conformance test failures by explicitly defining both application/json and text/event-stream media types in the 200 responses for streaming endpoints (/chat/completions and /completions). Changes: - Updated fastapi_routes.py to include explicit response schemas for both media types - Regenerated OpenAPI specs with proper 200 responses - Regenerated Stainless config This fixes the "response-success-status-removed" conformance errors while maintaining the dynamic streaming/non-streaming behavior. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

Generated with [Claude Code](https://claude.ai/code) via [Happy](https://happy.engineering) Co-Authored-By: Claude <noreply@anthropic.com> Co-Authored-By: Happy <yesreply@happy.engineering>

mergify · 2026-01-28T09:37:16Z

This pull request has merge conflicts that must be resolved before it can be merged. @r-bit-rry please rebase it. https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

docs/docs/api-openai/conformance.mdx

src/llama_stack_api/inference/api.py

leseb · 2026-01-28T11:13:57Z

@r-bit-rry a lot of failures in the tests.

Generated with [Claude Code](https://claude.ai/code) via [Happy](https://happy.engineering) Co-Authored-By: Claude <noreply@anthropic.com> Co-Authored-By: Happy <yesreply@happy.engineering>

leseb

looking good!

src/llama_stack_api/inference/models.py

JayDi11a and others added 7 commits January 28, 2026 00:32

feat: migrate Inference API to FastAPI router system

1cdcb8c

fix: add SSE utilities and update inference router for streaming support

ff43bc0

Generated with [Claude Code](https://claude.ai/code) via [Happy](https://happy.engineering) Co-Authored-By: Claude <noreply@anthropic.com> Co-Authored-By: Happy <yesreply@happy.engineering>

r-bit-rry requested review from ashwinb, bbrowning, cdoern, ehhuang, franciscojavierarceo, leseb, mattf and raghotham as code owners January 28, 2026 09:36

meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Jan 28, 2026

mergify bot added the needs-rebase label Jan 28, 2026

Merge upstream/main into router-inference

fb553f8

r-bit-rry mentioned this pull request Jan 28, 2026

feat: Inference Migration Using FastAPI router factory #4445

Closed

mergify bot removed the needs-rebase label Jan 28, 2026

leseb reviewed Jan 28, 2026

View reviewed changes

docs/docs/api-openai/conformance.mdx Outdated Show resolved Hide resolved

src/llama_stack_api/inference/api.py Outdated Show resolved Hide resolved

fix!: align inference router with upstream API contracts

c9a7d39

r-bit-rry force-pushed the router-inference branch from 18f4dd0 to c9a7d39 Compare January 28, 2026 11:36

r-bit-rry and others added 3 commits January 28, 2026 13:49

fix: use inference package instead of legacy module

4a8bc65

Generated with [Claude Code](https://claude.ai/code) via [Happy](https://happy.engineering) Co-Authored-By: Claude <noreply@anthropic.com> Co-Authored-By: Happy <yesreply@happy.engineering>

fix openai compliance

3aaecfe

Merge branch 'main' into router-inference

f63b1e0

leseb reviewed Jan 28, 2026

View reviewed changes

src/llama_stack_api/inference/models.py Outdated Show resolved Hide resolved

src/llama_stack_api/inference/models.py Outdated Show resolved Hide resolved

removing unnecessary cvar

6d746ce

leseb approved these changes Jan 29, 2026

View reviewed changes

leseb merged commit c921aed into llamastack:main Jan 29, 2026
45 of 46 checks passed

leseb mentioned this pull request Jan 29, 2026

Convert inference API to use a FastAPI router #4675

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: migrate Inference API to FastAPI router system #4755

feat: migrate Inference API to FastAPI router system #4755

Uh oh!

r-bit-rry commented Jan 28, 2026 •

edited

Loading

Uh oh!

mergify bot commented Jan 28, 2026

Uh oh!

Uh oh!

Uh oh!

leseb commented Jan 28, 2026

Uh oh!

leseb left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

feat: migrate Inference API to FastAPI router system #4755

feat: migrate Inference API to FastAPI router system #4755

Uh oh!

Conversation

r-bit-rry commented Jan 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Changes

Implementation Details

Test Plan

Uh oh!

mergify bot commented Jan 28, 2026

Uh oh!

Uh oh!

Uh oh!

leseb commented Jan 28, 2026

Uh oh!

leseb left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

r-bit-rry commented Jan 28, 2026 •

edited

Loading