Context\nDuring the 2026-05-09 GCP production rebuild and Cloudflare cutover, infrastructure was healthy but chat initially hung because both requested NVIDIA DeepSeek v4 models timed out from the production VM:\n\n- deepseek-ai/deepseek-v4-flash: chat completions timed out after 35-45s with no bytes\n- deepseek-ai/deepseek-v4-pro: chat completions timed out after 35s with no bytes\n\nThe NVIDIA API itself was reachable:\n\n- /v1/models: 200 in ~0.2s\n- meta/llama-3.1-8b-instruct: simple chat 200 in ~1.8-3.4s\n- qwen/qwen3-next-80b-a3b-instruct: tool-call probe 200 in ~1.35s and production smoke passed 12/12 after hotfix\n\nProduction was temporarily hotfixed to qwen/qwen3-next-80b-a3b-instruct so Wiii is alive while staying on NVIDIA provider.\n\n## Required work\n- Add model-level health probes for configured NVIDIA models at startup and periodically.\n- Mark timed-out models degraded and remove them from routing until recovery.\n- Support same-provider fallback order, e.g. DeepSeek Flash/Pro -> Qwen/Nemotron fallback.\n- Surface model health/fallback telemetry in
untime_latency, logs, and status endpoints without leaking API keys.\n- Update deploy smoke so model fallback behavior is visible and actionable.\n\n## Acceptance criteria\n- A broken configured model cannot stall normal chat for 70-170s.\n- Production chat chooses a healthy NVIDIA model automatically.\n- Visual/tool smoke continues to pass when the preferred DeepSeek model is degraded.\n- Operators can see which model is degraded and which fallback is active.\n\n## Risk\nHigh-runtime-impact area: provider routing, model selection, timeout behavior, streaming finalization. Implement in narrow PRs with targeted tests.
Context\nDuring the 2026-05-09 GCP production rebuild and Cloudflare cutover, infrastructure was healthy but chat initially hung because both requested NVIDIA DeepSeek v4 models timed out from the production VM:\n\n- deepseek-ai/deepseek-v4-flash: chat completions timed out after 35-45s with no bytes\n- deepseek-ai/deepseek-v4-pro: chat completions timed out after 35s with no bytes\n\nThe NVIDIA API itself was reachable:\n\n- /v1/models: 200 in ~0.2s\n- meta/llama-3.1-8b-instruct: simple chat 200 in ~1.8-3.4s\n- qwen/qwen3-next-80b-a3b-instruct: tool-call probe 200 in ~1.35s and production smoke passed 12/12 after hotfix\n\nProduction was temporarily hotfixed to qwen/qwen3-next-80b-a3b-instruct so Wiii is alive while staying on NVIDIA provider.\n\n## Required work\n- Add model-level health probes for configured NVIDIA models at startup and periodically.\n- Mark timed-out models degraded and remove them from routing until recovery.\n- Support same-provider fallback order, e.g. DeepSeek Flash/Pro -> Qwen/Nemotron fallback.\n- Surface model health/fallback telemetry in
untime_latency, logs, and status endpoints without leaking API keys.\n- Update deploy smoke so model fallback behavior is visible and actionable.\n\n## Acceptance criteria\n- A broken configured model cannot stall normal chat for 70-170s.\n- Production chat chooses a healthy NVIDIA model automatically.\n- Visual/tool smoke continues to pass when the preferred DeepSeek model is degraded.\n- Operators can see which model is degraded and which fallback is active.\n\n## Risk\nHigh-runtime-impact area: provider routing, model selection, timeout behavior, streaming finalization. Implement in narrow PRs with targeted tests.