Add streaming responses + litellm-sourced pricing by drogers0 · Pull Request #6 · smarttransit-ai/bedrock-api

drogers0 · 2026-06-05T20:07:37Z

Summary

Two features:

Streaming responses — the proxy can now stream Bedrock responses (SSE), via a transport migration to FastAPI + the Lambda Web Adapter on a Lambda Function URL (RESPONSE_STREAM), packaged as a container image. (HTTP API v2 cannot stream; Python Lambda needs LWA.)
litellm-sourced pricing — DEFAULT_PRICING is now generated from a vendored, Bedrock-filtered snapshot of litellm's community pricing data instead of the AWS pricing API (no runtime litellm dependency).

Pricing (litellm)

scripts/gen_pricing.py reads scripts/vendor/litellm_model_prices.json (vendored, MIT-attributed, refresh procedure documented) and converts per-token USD → integer µUSD/1k.
Cross-region variants derived bidirectionally for providers that actually have inference profiles (no phantom IDs).
Coverage-regression gate: generation hard-fails if any model ID in scripts/vendor/supported_model_ids.txt would lose explicit pricing (prevents silent fallback to the over-count tier).
Unknown models keep the budget-protective Opus-tier fallback — litellm's own behavior (silently $0 for unmapped models) is unsafe for a quota proxy, so it was deliberately not adopted.
Embedding models (billable via /invoke) retained; zero-priced (rerank) dropped. compute_cost runtime path unchanged.

Streaming (transport + routes)

New: lambda/proxy/app.py (FastAPI), deps.py, Dockerfile (python:3.12-slim + LWA), requirements.txt. handler.py removed. /converse, /invoke, /usage ported with identical auth/quota/billing semantics, error envelope, status codes, and structured-log events.
New routes: POST /model/{id}/converse-stream and POST /model/{id}/invoke-with-response-stream — faithful SSE relay of Bedrock events.
- Streams open eagerly so call-time errors return real 4xx/5xx (not 200 + error-event); mid-stream Bedrock EventStream error members (throttling, model-stream-error, model-timeout, …) become a terminal SSE error frame.
- Usage captured from the stream tail (converse metadata; invoke message_start/message_delta, including cache tokens) and billed post-flight via the same compute_cost + write_usage path, disconnect-safe in finally.
- Bytes-valued events are base64-encoded for JSON/SSE transport.
IAM adds bedrock:InvokeModelWithResponseStream; CloudWatch alarms added (Throttles/Errors/ConcurrentExecutions).

Deployment / safe cutover (important — 2 live deployments)

The legacy API Gateway is retained behind enable_http_api (default true), so a single terraform apply is additive (APIGW stays live, Function URL added). Retire APIGW only after clients are cut over by setting enable_http_api=false. Container image requires an ECR push before apply — see the runbook in terraform/main/main.tf / terraform/main/README.md. Run per AWS profile (primary + ccc). Function URLs have no built-in throttling — set per-token limit_rps and rely on reserved concurrency.

Testing

ruff check + ruff format --check clean; 131 tests pass, 3 e2e skipped (run with E2E=1 against a deployed stack); terraform fmt -check + validate pass.
Adds a CI gate (.github/workflows/ci.yml) running ruff + pytest on push/PR.
New: full FastAPI parity suite (test_app.py), streaming suite (test_streaming.py, 35 tests incl. eager-open errors, cache-token billing, EventStream error members, generator-level disconnect), pricing generator + fallback tests, and e2e streaming smokes.

Notes / follow-ups

The streaming SSE relay is Bedrock-native (not OpenAI-shaped) — clients parse Bedrock event JSON.
Deployment (terraform apply + ECR push + e2e against the live stack) is left to the operator per the runbook.

Replace the AWS-pricing-API generator with a vendored, Bedrock-filtered snapshot of litellm's model_prices_and_context_window.json (no runtime dep). gen_pricing.py converts per-token USD to integer micros/1k, derives cross-region variants for providers that have them, and hard-fails on any coverage regression vs a committed supported-model-id snapshot. Unknown models keep the budget-protective Opus-tier fallback (litellm's own $0 default is unsafe for a quota proxy). compute_cost runtime path unchanged.

Replace the API Gateway HTTP API + dict-handler with a FastAPI ASGI app served via the Lambda Web Adapter on a Function URL (RESPONSE_STREAM), packaged as a container image. Ports /converse, /invoke, /usage with identical auth/quota/billing semantics, error envelope, status codes, and structured-log events. Output cap applied as one shared step per route. The legacy HTTP API is retained behind enable_http_api (default true) so a single apply is additive on live deployments; retire it after client cutover by setting enable_http_api=false. Adds bedrock:InvokeModelWith- ResponseStream IAM, CloudWatch alarms, and the Phase B streaming seams.

…tream) Relay Bedrock ConverseStream / InvokeModelWithResponseStream over SSE. Streams open eagerly so call-time errors return real 4xx/5xx before the response commits; mid-stream EventStream error members (incl. throttling, model-stream-error, model-timeout) map to a terminal SSE error frame. Usage is captured from the stream tail (converse metadata; invoke message_start/delta incl. cache tokens) and billed post-flight via the shared compute_cost + write_usage path, disconnect-safe in finally. Bytes-valued events are base64-encoded for JSON/SSE transport.

Adding count to module.api (enable_http_api flag) changes resource addresses module.api.* -> module.api[0].*. Without this moved block, terraform plan shows the live API Gateway being destroyed and recreated (new URL + downtime). Relocating the module instance keeps the existing APIGW in place so an enable_http_api=true apply is truly additive.

drogers0 added 5 commits June 5, 2026 14:12

ci: add GitHub Actions gate (ruff check + format + pytest)

7f6abbb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add streaming responses + litellm-sourced pricing#6

Add streaming responses + litellm-sourced pricing#6
drogers0 wants to merge 5 commits into
mainfrom
feat/streaming-and-litellm-pricing

drogers0 commented Jun 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

drogers0 commented Jun 5, 2026

Summary

Pricing (litellm)

Streaming (transport + routes)

Deployment / safe cutover (important — 2 live deployments)

Testing

Notes / follow-ups

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant