Add streaming responses + litellm-sourced pricing#6
Open
drogers0 wants to merge 5 commits into
Open
Conversation
Replace the AWS-pricing-API generator with a vendored, Bedrock-filtered snapshot of litellm's model_prices_and_context_window.json (no runtime dep). gen_pricing.py converts per-token USD to integer micros/1k, derives cross-region variants for providers that have them, and hard-fails on any coverage regression vs a committed supported-model-id snapshot. Unknown models keep the budget-protective Opus-tier fallback (litellm's own $0 default is unsafe for a quota proxy). compute_cost runtime path unchanged.
Replace the API Gateway HTTP API + dict-handler with a FastAPI ASGI app served via the Lambda Web Adapter on a Function URL (RESPONSE_STREAM), packaged as a container image. Ports /converse, /invoke, /usage with identical auth/quota/billing semantics, error envelope, status codes, and structured-log events. Output cap applied as one shared step per route. The legacy HTTP API is retained behind enable_http_api (default true) so a single apply is additive on live deployments; retire it after client cutover by setting enable_http_api=false. Adds bedrock:InvokeModelWith- ResponseStream IAM, CloudWatch alarms, and the Phase B streaming seams.
…tream) Relay Bedrock ConverseStream / InvokeModelWithResponseStream over SSE. Streams open eagerly so call-time errors return real 4xx/5xx before the response commits; mid-stream EventStream error members (incl. throttling, model-stream-error, model-timeout) map to a terminal SSE error frame. Usage is captured from the stream tail (converse metadata; invoke message_start/delta incl. cache tokens) and billed post-flight via the shared compute_cost + write_usage path, disconnect-safe in finally. Bytes-valued events are base64-encoded for JSON/SSE transport.
Adding count to module.api (enable_http_api flag) changes resource addresses module.api.* -> module.api[0].*. Without this moved block, terraform plan shows the live API Gateway being destroyed and recreated (new URL + downtime). Relocating the module instance keeps the existing APIGW in place so an enable_http_api=true apply is truly additive.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Two features:
RESPONSE_STREAM), packaged as a container image. (HTTP API v2 cannot stream; Python Lambda needs LWA.)DEFAULT_PRICINGis now generated from a vendored, Bedrock-filtered snapshot of litellm's community pricing data instead of the AWS pricing API (no runtime litellm dependency).Pricing (litellm)
scripts/gen_pricing.pyreadsscripts/vendor/litellm_model_prices.json(vendored, MIT-attributed, refresh procedure documented) and converts per-token USD → integer µUSD/1k.scripts/vendor/supported_model_ids.txtwould lose explicit pricing (prevents silent fallback to the over-count tier)./invoke) retained; zero-priced (rerank) dropped.compute_costruntime path unchanged.Streaming (transport + routes)
lambda/proxy/app.py(FastAPI),deps.py,Dockerfile(python:3.12-slim + LWA),requirements.txt.handler.pyremoved./converse,/invoke,/usageported with identical auth/quota/billing semantics, error envelope, status codes, and structured-log events.POST /model/{id}/converse-streamandPOST /model/{id}/invoke-with-response-stream— faithful SSE relay of Bedrock events.metadata; invokemessage_start/message_delta, including cache tokens) and billed post-flight via the samecompute_cost+write_usagepath, disconnect-safe infinally.bedrock:InvokeModelWithResponseStream; CloudWatch alarms added (Throttles/Errors/ConcurrentExecutions).Deployment / safe cutover (important — 2 live deployments)
The legacy API Gateway is retained behind
enable_http_api(defaulttrue), so a singleterraform applyis additive (APIGW stays live, Function URL added). Retire APIGW only after clients are cut over by settingenable_http_api=false. Container image requires an ECR push before apply — see the runbook interraform/main/main.tf/terraform/main/README.md. Run per AWS profile (primary + ccc). Function URLs have no built-in throttling — set per-tokenlimit_rpsand rely on reserved concurrency.Testing
ruff check+ruff format --checkclean; 131 tests pass, 3 e2e skipped (run withE2E=1against a deployed stack);terraform fmt -check+validatepass..github/workflows/ci.yml) running ruff + pytest on push/PR.test_app.py), streaming suite (test_streaming.py, 35 tests incl. eager-open errors, cache-token billing, EventStream error members, generator-level disconnect), pricing generator + fallback tests, and e2e streaming smokes.Notes / follow-ups