config: raise vlm_api_concurrency default 1 → 16#63
Merged
Conversation
The API-layer semaphore in routers/inference.py:_get_vlm_semaphore was introduced to protect local GPU backends from concurrent generation. It defaulted to 1, which made sense before backend-owned locking landed in PR #62. Now that each backend owns its own _lock (BaseBackend._lock = Lock() for local, RemoteHTTPBackend._lock = nullcontext() for remote), the API semaphore is no longer the serialization point: - Local backends still serialize on their per-backend lock — a higher semaphore value just lets requests wait at the lock instead of at the HTTP handler. Observable behavior is identical. - Remote backends use nullcontext, so the semaphore value directly controls how many HTTPS requests run in parallel against the upstream provider (e.g. DashScope). In prod (multi-camera cortex deployment with TRIO_REMOTE_VLM_URL set), default=1 caused VLM avg latency of ~12.7s because cortex sent up to 10 concurrent describe calls but trio-core gated them back to 1. Operators had to set TRIO_VLM_API_CONCURRENCY=16 explicitly to unblock parallelism. Raising the default to 16 makes the common remote-backend case work out of the box. Operators can still lower it via TRIO_VLM_API_CONCURRENCY if a provider rate-limits aggressively. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
EngineConfig.vlm_api_concurrencydefault from1to16.Why
The API-layer semaphore in
routers/inference.py:_get_vlm_semaphorewas added to protect local GPU backends from concurrent generation. It defaulted to1, which made sense before backend-owned locking landed in #62.After #62, each backend owns its own
_lock:BaseBackend._lock = threading.Lock()— local backends serialize generation hereRemoteHTTPBackend._lock = nullcontext()— remote backends don't serializeThe API semaphore is no longer the serialization point:
Prod impact
In the cortex deployment with
TRIO_REMOTE_VLM_URLset (DashScope),default=1caused VLM avg latency of ~12.7s because cortex sends up to 10 concurrent describe calls but trio-core gated them back to 1. Operators had to setTRIO_VLM_API_CONCURRENCY=16explicitly to activate parallelism.Raising the default makes the common remote-backend case work out of the box. Operators can still lower it via
TRIO_VLM_API_CONCURRENCYif a remote provider rate-limits.Test plan
grep _vlm_semaphorein restart logs) and VLM avg latency drops from 12.7s → ~1–3s.🤖 Generated with Claude Code