Skip to content

Conversation

@r-bit-rry
Copy link
Contributor

@r-bit-rry r-bit-rry commented Jan 28, 2026

What does this PR do?

This PR migrates the Inference API to the FastAPI router system, building on the work done in PR #4191. This continues the effort to move away from the legacy @webmethod decorator system to explicit FastAPI routers.

Changes

  • Inference API Migration: Migrated the Inference API to use FastAPI routers following the established API package structure pattern
  • SSE Streaming Support: Added SSE utilities for streaming inference endpoints (chat completions, completions)
  • OpenAPI Spec Updates: Updated OpenAPI specifications and Stainless config for the new router structure
  • Documentation Updates: Updated tutorial examples to use the new Agent API patterns

Implementation Details

  • Protocol definitions and models live in llama_stack_api/inference/
  • FastAPI router implementation follows the established pattern from other migrated APIs
  • The /v1alpha/inference/rerank endpoint is properly configured in the Stainless config
  • Explicit 200 responses added for streaming endpoints to properly document SSE behavior

This represents an incremental migration of the Inference API to the router system while maintaining full backward compatibility.

Test Plan

  1. Verify routes are preserved:
curl http://localhost:8321/v1/inspect/routes | jq '.data[] | select(.route | contains("inference") or contains("chat") or contains("completion") or contains("embedding"))'
  1. Run unit tests:
uv run pytest tests/unit/core/ -v
  1. Run integration tests:
uv run pytest tests/integration/inference/ -vv --stack-config=http://localhost:8321

Co-authored-by: Gerald Trotman gtrotman@redhat.com (@JayDi11a)

This PR supersedes #4445 with a clean, rebased history.

BREAKING CHANGES: update OpenAIEmbeddingsRequestWithExtraBody to support token array

JayDi11a and others added 7 commits January 28, 2026 00:32
Update demo script to use the newer LlamaStackClient and Agent API instead
of the manual OpenAI client approach.

Changes:
- Switch from OpenAI client to LlamaStackClient
- Use Agent API for simplified RAG implementation
- Auto-select models with preference for Ollama (no API key needed)
- Reduce code complexity from ~136 to ~102 lines
- Remove manual RAG implementation in favor of agentic approach

This provides a cleaner, more modern example for users getting started
with Llama Stack.
Simplify the Ollama model selection logic in the detailed tutorial.

Changes:
- Replace complex custom_metadata filtering with simple ID check
- Use direct 'ollama' in model ID check instead of metadata lookup
- Makes code more concise and easier to understand

This aligns with the simplified approach used in the updated demo_script.py.
Update the agent examples to use the latest API methods.

Changes:
- Simplify model selection (already applied in previous commit)
- Use response.output_text instead of response.output_message.content
- Use direct print(event) instead of event.print() for streaming

This aligns the tutorial with the current Agent API implementation.
Modernize the RAG agent example to use the latest Vector Stores API.

Changes:
- Replace deprecated VectorDB API with Vector Stores API
- Use file upload and vector_stores.create() instead of rag_tool.insert()
- Download files via requests and upload to Llama Stack
- Update to use file_search tool type with vector_store_ids
- Simplify model selection with Ollama preference
- Improve logging and user feedback
- Update event logging to handle both old and new API
- Add note about known server routing issues

This provides a more accurate example using current Llama Stack APIs.
Fix conformance test failures by explicitly defining both application/json
and text/event-stream media types in the 200 responses for streaming
endpoints (/chat/completions and /completions).

Changes:
- Updated fastapi_routes.py to include explicit response schemas for both media types
- Regenerated OpenAPI specs with proper 200 responses
- Regenerated Stainless config

This fixes the "response-success-status-removed" conformance errors while
maintaining the dynamic streaming/non-streaming behavior.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Generated with [Claude Code](https://claude.ai/code)
via [Happy](https://happy.engineering)

Co-Authored-By: Claude <noreply@anthropic.com>
Co-Authored-By: Happy <yesreply@happy.engineering>
@meta-cla meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Jan 28, 2026
@mergify
Copy link

mergify bot commented Jan 28, 2026

This pull request has merge conflicts that must be resolved before it can be merged. @r-bit-rry please rebase it. https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Jan 28, 2026
@leseb
Copy link
Collaborator

leseb commented Jan 28, 2026

@r-bit-rry a lot of failures in the tests.

r-bit-rry and others added 3 commits January 28, 2026 13:49
Generated with [Claude Code](https://claude.ai/code)
via [Happy](https://happy.engineering)

Co-Authored-By: Claude <noreply@anthropic.com>
Co-Authored-By: Happy <yesreply@happy.engineering>
Copy link
Collaborator

@leseb leseb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looking good!

@leseb leseb merged commit c921aed into llamastack:main Jan 29, 2026
45 of 46 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants