Skip to content

runtime: add NVIDIA model health probes and automatic fallback #253

@meiiie

Description

@meiiie

Context\nDuring the 2026-05-09 GCP production rebuild and Cloudflare cutover, infrastructure was healthy but chat initially hung because both requested NVIDIA DeepSeek v4 models timed out from the production VM:\n\n- deepseek-ai/deepseek-v4-flash: chat completions timed out after 35-45s with no bytes\n- deepseek-ai/deepseek-v4-pro: chat completions timed out after 35s with no bytes\n\nThe NVIDIA API itself was reachable:\n\n- /v1/models: 200 in ~0.2s\n- meta/llama-3.1-8b-instruct: simple chat 200 in ~1.8-3.4s\n- qwen/qwen3-next-80b-a3b-instruct: tool-call probe 200 in ~1.35s and production smoke passed 12/12 after hotfix\n\nProduction was temporarily hotfixed to qwen/qwen3-next-80b-a3b-instruct so Wiii is alive while staying on NVIDIA provider.\n\n## Required work\n- Add model-level health probes for configured NVIDIA models at startup and periodically.\n- Mark timed-out models degraded and remove them from routing until recovery.\n- Support same-provider fallback order, e.g. DeepSeek Flash/Pro -> Qwen/Nemotron fallback.\n- Surface model health/fallback telemetry in

untime_latency, logs, and status endpoints without leaking API keys.\n- Update deploy smoke so model fallback behavior is visible and actionable.\n\n## Acceptance criteria\n- A broken configured model cannot stall normal chat for 70-170s.\n- Production chat chooses a healthy NVIDIA model automatically.\n- Visual/tool smoke continues to pass when the preferred DeepSeek model is degraded.\n- Operators can see which model is degraded and which fallback is active.\n\n## Risk\nHigh-runtime-impact area: provider routing, model selection, timeout behavior, streaming finalization. Implement in narrow PRs with targeted tests.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions