fix(extraction): cap concurrent LLM calls to stop aegis-burst timeout cascade by arshadansari27 · Pull Request #86 · arshadansari27/knowledge-service

arshadansari27 · 2026-05-26T13:23:43Z

Summary

Aegis intelligence dumps the daily arxiv batch (30–50 summaries) at KS in seconds. Every job's worker calls qwen3:14b at once; Ollama on asif serves ~2–4 in parallel; the tail queues inside Ollama past the 600s read timeout. Retries (KS ×2 + LiteLLM ×2) re-enter the same queue → cascade → entire batch yields 0 triples. Prod signature: 11+ ReadTimeouts clustered in ~2s about 10 minutes after a burst.
Fix: ExtractionClient holds an asyncio.Semaphore around the LLM POST, sized via settings.extraction_max_concurrent (default 4, env EXTRACTION_MAX_CONCURRENT). The semaphore wraps just the request, not the retry backoff, so a failing call doesn't hog a slot during its exponential sleep. This shifts the queue from Ollama-side (where read_timeout fires) to KS-side (where it doesn't).
Distinct from PRs fix(models,llm): stop silently dropping ~9% of qwen3 extractions #73/fix(models): recover the remaining 83% of qwen3 extraction rejections #74 (schema-rejection saga) and earlier prod-data-quality fixes — those addressed extraction content loss. This addresses extraction timeout loss.

What this is not

Not raising the read timeout (would just push the cliff out and slow the cascade, not stop it).
Not touching aegis. Aegis's sequential awaits look like a burst to KS because each call returns 202 immediately; the right fix is downstream concurrency control.
Not changing LiteLLM num_retries: 2. That's a multiplier on top of this cap and worth dropping later, but a separate change in homelab-gitops.

Reproduction

Local repro mirroring KS's exact httpx.Timeout(connect=5, read=600, write=10, pool=5) against prod LiteLLM, 30 concurrent realistic extraction prompts:

wall=600.1s OK=28 fail=2
OK lat min=22.3 med=313.5 max=593.2
FAIL: (10, 'ReadTimeout', 600.1, "ReadTimeout('')")
FAIL: (28, 'ReadTimeout', 600.1, "ReadTimeout('')")

Median per-call latency at burst = 313s (>5 min). 2 calls hit the 10-min boundary. Zero PoolTimeout — it's pure queueing on qwen3.

Test plan

New unit test TestConcurrencyCap::test_inflight_never_exceeds_cap — fires 12 concurrent _post_chat calls at a slow stub and asserts max-in-flight ≤ cap. Fails on main, passes here.
TestConcurrencyCap::test_default_cap_is_set — guards against accidentally removing the semaphore.
pytest tests/ -v — 703 passed.
ruff check . — clean. ruff format --check . — clean.
Post-deploy: watch aegis_knowledge logs after the next aegis intelligence batch; expect the burst tail to take ~2 min instead of timing out. No LLM API request timed out clusters.

🤖 Generated with Claude Code

… cascade When aegis intelligence (worker/.../intelligence.py) pushes its daily arxiv batch (~30–50 summaries in seconds), every ingestion job hits qwen3:14b at once. Ollama on asif serves ~2–4 in parallel, so the tail queues inside Ollama past the 600s read timeout, retries (KS ×2, LiteLLM ×2) snowball back into the same queue, and the whole class falls to 0 triples. Prod signature: 11+ ReadTimeouts clustered in ~2s about 10 min after a burst. Fix: ExtractionClient now holds an asyncio.Semaphore around the LLM POST, sized via settings.extraction_max_concurrent (default 4, env EXTRACTION_MAX_CONCURRENT). The semaphore wraps just the request, not the retry backoff, so a failing call doesn't hog a slot. This moves the queue from Ollama-side (where read_timeout fires) to KS-side (where it doesn't). Repro mirrored prod: 30 concurrent realistic prompts via KS's exact httpx config → median 313s, max 593s, 2 ReadTimeouts at 600.1s, zero PoolTimeouts. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

arshadansari27 merged commit 119a5a2 into main May 26, 2026
5 checks passed

arshadansari27 deleted the worktree-fix-extraction-concurrency-cascade branch May 26, 2026 14:20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(extraction): cap concurrent LLM calls to stop aegis-burst timeout cascade#86

fix(extraction): cap concurrent LLM calls to stop aegis-burst timeout cascade#86
arshadansari27 merged 1 commit into
mainfrom
worktree-fix-extraction-concurrency-cascade

arshadansari27 commented May 26, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

arshadansari27 commented May 26, 2026

Summary

What this is not

Reproduction

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant