Skip to content

Add streaming responses + litellm-sourced pricing#6

Open
drogers0 wants to merge 5 commits into
mainfrom
feat/streaming-and-litellm-pricing
Open

Add streaming responses + litellm-sourced pricing#6
drogers0 wants to merge 5 commits into
mainfrom
feat/streaming-and-litellm-pricing

Conversation

@drogers0
Copy link
Copy Markdown
Contributor

@drogers0 drogers0 commented Jun 5, 2026

Summary

Two features:

  1. Streaming responses — the proxy can now stream Bedrock responses (SSE), via a transport migration to FastAPI + the Lambda Web Adapter on a Lambda Function URL (RESPONSE_STREAM), packaged as a container image. (HTTP API v2 cannot stream; Python Lambda needs LWA.)
  2. litellm-sourced pricingDEFAULT_PRICING is now generated from a vendored, Bedrock-filtered snapshot of litellm's community pricing data instead of the AWS pricing API (no runtime litellm dependency).

Pricing (litellm)

  • scripts/gen_pricing.py reads scripts/vendor/litellm_model_prices.json (vendored, MIT-attributed, refresh procedure documented) and converts per-token USD → integer µUSD/1k.
  • Cross-region variants derived bidirectionally for providers that actually have inference profiles (no phantom IDs).
  • Coverage-regression gate: generation hard-fails if any model ID in scripts/vendor/supported_model_ids.txt would lose explicit pricing (prevents silent fallback to the over-count tier).
  • Unknown models keep the budget-protective Opus-tier fallback — litellm's own behavior (silently $0 for unmapped models) is unsafe for a quota proxy, so it was deliberately not adopted.
  • Embedding models (billable via /invoke) retained; zero-priced (rerank) dropped. compute_cost runtime path unchanged.

Streaming (transport + routes)

  • New: lambda/proxy/app.py (FastAPI), deps.py, Dockerfile (python:3.12-slim + LWA), requirements.txt. handler.py removed. /converse, /invoke, /usage ported with identical auth/quota/billing semantics, error envelope, status codes, and structured-log events.
  • New routes: POST /model/{id}/converse-stream and POST /model/{id}/invoke-with-response-stream — faithful SSE relay of Bedrock events.
    • Streams open eagerly so call-time errors return real 4xx/5xx (not 200 + error-event); mid-stream Bedrock EventStream error members (throttling, model-stream-error, model-timeout, …) become a terminal SSE error frame.
    • Usage captured from the stream tail (converse metadata; invoke message_start/message_delta, including cache tokens) and billed post-flight via the same compute_cost + write_usage path, disconnect-safe in finally.
    • Bytes-valued events are base64-encoded for JSON/SSE transport.
  • IAM adds bedrock:InvokeModelWithResponseStream; CloudWatch alarms added (Throttles/Errors/ConcurrentExecutions).

Deployment / safe cutover (important — 2 live deployments)

The legacy API Gateway is retained behind enable_http_api (default true), so a single terraform apply is additive (APIGW stays live, Function URL added). Retire APIGW only after clients are cut over by setting enable_http_api=false. Container image requires an ECR push before apply — see the runbook in terraform/main/main.tf / terraform/main/README.md. Run per AWS profile (primary + ccc). Function URLs have no built-in throttling — set per-token limit_rps and rely on reserved concurrency.

Testing

  • ruff check + ruff format --check clean; 131 tests pass, 3 e2e skipped (run with E2E=1 against a deployed stack); terraform fmt -check + validate pass.
  • Adds a CI gate (.github/workflows/ci.yml) running ruff + pytest on push/PR.
  • New: full FastAPI parity suite (test_app.py), streaming suite (test_streaming.py, 35 tests incl. eager-open errors, cache-token billing, EventStream error members, generator-level disconnect), pricing generator + fallback tests, and e2e streaming smokes.

Notes / follow-ups

  • The streaming SSE relay is Bedrock-native (not OpenAI-shaped) — clients parse Bedrock event JSON.
  • Deployment (terraform apply + ECR push + e2e against the live stack) is left to the operator per the runbook.

drogers0 added 5 commits June 5, 2026 14:12
Replace the AWS-pricing-API generator with a vendored, Bedrock-filtered
snapshot of litellm's model_prices_and_context_window.json (no runtime
dep). gen_pricing.py converts per-token USD to integer micros/1k, derives
cross-region variants for providers that have them, and hard-fails on any
coverage regression vs a committed supported-model-id snapshot. Unknown
models keep the budget-protective Opus-tier fallback (litellm's own $0
default is unsafe for a quota proxy). compute_cost runtime path unchanged.
Replace the API Gateway HTTP API + dict-handler with a FastAPI ASGI app
served via the Lambda Web Adapter on a Function URL (RESPONSE_STREAM),
packaged as a container image. Ports /converse, /invoke, /usage with
identical auth/quota/billing semantics, error envelope, status codes, and
structured-log events. Output cap applied as one shared step per route.

The legacy HTTP API is retained behind enable_http_api (default true) so a
single apply is additive on live deployments; retire it after client
cutover by setting enable_http_api=false. Adds bedrock:InvokeModelWith-
ResponseStream IAM, CloudWatch alarms, and the Phase B streaming seams.
…tream)

Relay Bedrock ConverseStream / InvokeModelWithResponseStream over SSE.
Streams open eagerly so call-time errors return real 4xx/5xx before the
response commits; mid-stream EventStream error members (incl. throttling,
model-stream-error, model-timeout) map to a terminal SSE error frame.
Usage is captured from the stream tail (converse metadata; invoke
message_start/delta incl. cache tokens) and billed post-flight via the
shared compute_cost + write_usage path, disconnect-safe in finally.
Bytes-valued events are base64-encoded for JSON/SSE transport.
Adding count to module.api (enable_http_api flag) changes resource
addresses module.api.* -> module.api[0].*. Without this moved block,
terraform plan shows the live API Gateway being destroyed and recreated
(new URL + downtime). Relocating the module instance keeps the existing
APIGW in place so an enable_http_api=true apply is truly additive.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant