Skip to content

fix(extraction): cap concurrent LLM calls to stop aegis-burst timeout cascade#86

Merged
arshadansari27 merged 1 commit into
mainfrom
worktree-fix-extraction-concurrency-cascade
May 26, 2026
Merged

fix(extraction): cap concurrent LLM calls to stop aegis-burst timeout cascade#86
arshadansari27 merged 1 commit into
mainfrom
worktree-fix-extraction-concurrency-cascade

Conversation

@arshadansari27
Copy link
Copy Markdown
Owner

Summary

  • Aegis intelligence dumps the daily arxiv batch (30–50 summaries) at KS in seconds. Every job's worker calls qwen3:14b at once; Ollama on asif serves ~2–4 in parallel; the tail queues inside Ollama past the 600s read timeout. Retries (KS ×2 + LiteLLM ×2) re-enter the same queue → cascade → entire batch yields 0 triples. Prod signature: 11+ ReadTimeouts clustered in ~2s about 10 minutes after a burst.
  • Fix: ExtractionClient holds an asyncio.Semaphore around the LLM POST, sized via settings.extraction_max_concurrent (default 4, env EXTRACTION_MAX_CONCURRENT). The semaphore wraps just the request, not the retry backoff, so a failing call doesn't hog a slot during its exponential sleep. This shifts the queue from Ollama-side (where read_timeout fires) to KS-side (where it doesn't).
  • Distinct from PRs fix(models,llm): stop silently dropping ~9% of qwen3 extractions #73/fix(models): recover the remaining 83% of qwen3 extraction rejections #74 (schema-rejection saga) and earlier prod-data-quality fixes — those addressed extraction content loss. This addresses extraction timeout loss.

What this is not

  • Not raising the read timeout (would just push the cliff out and slow the cascade, not stop it).
  • Not touching aegis. Aegis's sequential awaits look like a burst to KS because each call returns 202 immediately; the right fix is downstream concurrency control.
  • Not changing LiteLLM num_retries: 2. That's a multiplier on top of this cap and worth dropping later, but a separate change in homelab-gitops.

Reproduction

Local repro mirroring KS's exact httpx.Timeout(connect=5, read=600, write=10, pool=5) against prod LiteLLM, 30 concurrent realistic extraction prompts:

wall=600.1s OK=28 fail=2
OK lat min=22.3 med=313.5 max=593.2
FAIL: (10, 'ReadTimeout', 600.1, "ReadTimeout('')")
FAIL: (28, 'ReadTimeout', 600.1, "ReadTimeout('')")

Median per-call latency at burst = 313s (>5 min). 2 calls hit the 10-min boundary. Zero PoolTimeout — it's pure queueing on qwen3.

Test plan

  • New unit test TestConcurrencyCap::test_inflight_never_exceeds_cap — fires 12 concurrent _post_chat calls at a slow stub and asserts max-in-flight ≤ cap. Fails on main, passes here.
  • TestConcurrencyCap::test_default_cap_is_set — guards against accidentally removing the semaphore.
  • pytest tests/ -v — 703 passed.
  • ruff check . — clean. ruff format --check . — clean.
  • Post-deploy: watch aegis_knowledge logs after the next aegis intelligence batch; expect the burst tail to take ~2 min instead of timing out. No LLM API request timed out clusters.

🤖 Generated with Claude Code

… cascade

When aegis intelligence (worker/.../intelligence.py) pushes its daily arxiv
batch (~30–50 summaries in seconds), every ingestion job hits qwen3:14b at
once. Ollama on asif serves ~2–4 in parallel, so the tail queues inside
Ollama past the 600s read timeout, retries (KS ×2, LiteLLM ×2) snowball
back into the same queue, and the whole class falls to 0 triples. Prod
signature: 11+ ReadTimeouts clustered in ~2s about 10 min after a burst.

Fix: ExtractionClient now holds an asyncio.Semaphore around the LLM POST,
sized via settings.extraction_max_concurrent (default 4, env
EXTRACTION_MAX_CONCURRENT). The semaphore wraps just the request, not the
retry backoff, so a failing call doesn't hog a slot. This moves the queue
from Ollama-side (where read_timeout fires) to KS-side (where it doesn't).

Repro mirrored prod: 30 concurrent realistic prompts via KS's exact httpx
config → median 313s, max 593s, 2 ReadTimeouts at 600.1s, zero PoolTimeouts.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@arshadansari27 arshadansari27 merged commit 119a5a2 into main May 26, 2026
5 checks passed
@arshadansari27 arshadansari27 deleted the worktree-fix-extraction-concurrency-cascade branch May 26, 2026 14:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant