fix(writer): explicit timeout + retries on every LLM call#19
Open
Chen17-sq wants to merge 1 commit intoEinsia:mainfrom
Open
fix(writer): explicit timeout + retries on every LLM call#19Chen17-sq wants to merge 1 commit intoEinsia:mainfrom
Chen17-sq wants to merge 1 commit intoEinsia:mainfrom
Conversation
``writer/llm.call_llm`` previously called ``litellm.completion(**kwargs)`` with no ``timeout`` and no ``num_retries``. Production litellm versions default the request to "wait forever", so a stuck connection (slow provider, partial response, dead TCP socket) blocks the reducer / classifier daemon thread *forever* — the daemon stays "alive" but no durable facts get written and there's no log entry to alert the user. Combined with the recent orphan-session recovery work, this is the remaining hang point that prevents end-to-end self-healing. This adds two module-level defaults in ``writer/llm.py``: * ``DEFAULT_TIMEOUT_SECONDS = 120`` — generous for a long session reduce without being absurd; per-stage overridable via ``[models.<stage>] timeout_seconds = N`` for slow local models. * ``DEFAULT_NUM_RETRIES = 2`` — 3 total attempts; litellm uses tenacity-style backoff and respects ``Retry-After`` headers, so this absorbs a brief 429 / 502 blip without piling on latency. Two new fields on ``ModelConfig`` (``timeout_seconds`` and ``num_retries``, both ``int | None = None``) carry the per-stage override. ``None`` means "use module default", so an existing config with no mention of these fields parses cleanly and gets the safe defaults — no breakage for users on older configs. The default config template gains commented-out lines documenting both knobs, so first-run installs see the option even before they need it. Tests cover both layers: * ``tests/test_writer_llm.py`` — patches ``litellm.completion`` and asserts the kwargs forwarded to it carry timeout/retries correctly: defaults applied with no override, per-stage override honored, sibling stages don't bleed, and the ``OPENCHRONICLE_LLM_MOCK`` short-circuit still bypasses litellm entirely. * ``tests/test_config.py`` adds three regression tests for the TOML loading path: inheritance from ``[models.default]`` to a stage, per-stage override winning over default, and old configs without the fields parsing as ``None`` (the "use module default" sentinel). 76/76 pass; ruff clean. Worst-case bounded behavior: a fully unresponsive provider keeps the reducer thread blocked for ``timeout * (1 + num_retries)`` = 360 s before the exception propagates up to the reducer's outer ``except``, which marks the session ``failed`` and lets the safety- net retry tick handle it. Far better than the unbounded wait today.
There was a problem hiding this comment.
Code Review
This pull request introduces configurable timeouts and retry logic for LLM calls to prevent indefinite blocking of daemon threads. It adds timeout_seconds and num_retries fields to the ModelConfig dataclass, establishes module-level defaults (120 seconds and 2 retries, respectively), and ensures these parameters are correctly passed to the litellm library. Comprehensive tests were added to verify configuration inheritance, overrides, and correct parameter forwarding. I have no feedback to provide.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary — silent reducer / classifier hang
writer/llm.call_llmcallslitellm.completion(**kwargs)with notimeoutand nonum_retries. Production litellm versions default the request to "wait forever", so a stuck connection (slow provider, partial response, dead TCP socket) blocks the reducer or classifier daemon thread indefinitely. The daemon stays "alive" but no durable facts get written and nothing in the logs alerts the user. Combined with the orphan-session recovery in #18, this is the last hang point that prevents end-to-end self-healing.Fix
Two module-level defaults in
writer/llm.py:DEFAULT_TIMEOUT_SECONDS = 120— generous for a long session reduce without being absurd; reasoning-model / slow local-model users can raise it per stage.DEFAULT_NUM_RETRIES = 2— 3 total attempts. litellm uses tenacity-style backoff and respectsRetry-Afterheaders, so this absorbs a brief 429 / 502 blip without adding noticeable latency.Two new fields on
ModelConfig:timeout_seconds: int | None = Noneandnum_retries: int | None = None.Nonemeans "use module default", so existing configs that don't mention these parse cleanly and get the safe defaults — no breakage for upgrades.The default config template gains commented-out lines documenting both knobs:
Worst-case behavior
A fully unresponsive provider keeps the reducer thread blocked for
timeout * (1 + num_retries)= 360 s before the exception propagates to the reducer's outerexcept, which marks the sessionfailedand lets the safety-net retry tick handle it. Far better than today's unbounded wait.Relationship to other PRs
activeafter a hard crash; this PR ensures the eventual reduce of those sessions can't hang on a slow LLM provider. Together: complete self-healing.Mergeable in any order.
Out of scope
timeout_seconds = 0ornum_retries = 100— explicit user choices are passed through; clamping would silently override what they wrote.Test plan
uv run pytest— 76/76 passuv run ruff check src/ tests/test_writer_llm.py tests/test_config.py— cleanNew tests cover both layers:
tests/test_writer_llm.py:test_call_llm_passes_default_timeout_and_retries— patcheslitellm.completion, asserts module defaults are forwarded.test_call_llm_per_stage_timeout_override— stage-level override wins.test_call_llm_other_stages_keep_defaults_when_one_is_overridden— sibling stages don't bleed.test_call_llm_mock_path_bypasses_litellm—OPENCHRONICLE_LLM_MOCK=1still short-circuits before litellm import.tests/test_config.py:test_timeout_and_retries_inherit_from_default— TOML inheritance from[models.default]to a stage section.test_timeout_per_stage_overrides_default— per-stage override wins over default.test_missing_timeout_defaults_to_none— old configs without the fields parse asNone, triggering the "use module default" path in the wrapper.