Skip to content

fix(writer): explicit timeout + retries on every LLM call#19

Open
Chen17-sq wants to merge 1 commit intoEinsia:mainfrom
Chen17-sq:fix/llm-timeout-and-retries
Open

fix(writer): explicit timeout + retries on every LLM call#19
Chen17-sq wants to merge 1 commit intoEinsia:mainfrom
Chen17-sq:fix/llm-timeout-and-retries

Conversation

@Chen17-sq
Copy link
Copy Markdown
Contributor

Summary — silent reducer / classifier hang

writer/llm.call_llm calls litellm.completion(**kwargs) with no timeout and no num_retries. Production litellm versions default the request to "wait forever", so a stuck connection (slow provider, partial response, dead TCP socket) blocks the reducer or classifier daemon thread indefinitely. The daemon stays "alive" but no durable facts get written and nothing in the logs alerts the user. Combined with the orphan-session recovery in #18, this is the last hang point that prevents end-to-end self-healing.

Fix

Two module-level defaults in writer/llm.py:

  • DEFAULT_TIMEOUT_SECONDS = 120 — generous for a long session reduce without being absurd; reasoning-model / slow local-model users can raise it per stage.
  • DEFAULT_NUM_RETRIES = 2 — 3 total attempts. litellm uses tenacity-style backoff and respects Retry-After headers, so this absorbs a brief 429 / 502 blip without adding noticeable latency.

Two new fields on ModelConfig: timeout_seconds: int | None = None and num_retries: int | None = None. None means "use module default", so existing configs that don't mention these parse cleanly and get the safe defaults — no breakage for upgrades.

The default config template gains commented-out lines documenting both knobs:

[models.default]
# timeout_seconds = 120 # per-call timeout; raise this if you run a slow local model
# num_retries = 2       # automatic retries on transient errors (429, network blips)

Worst-case behavior

A fully unresponsive provider keeps the reducer thread blocked for timeout * (1 + num_retries) = 360 s before the exception propagates to the reducer's outer except, which marks the session failed and lets the safety-net retry tick handle it. Far better than today's unbounded wait.

Relationship to other PRs

Mergeable in any order.

Out of scope

  • Streaming / partial-response timeouts — OpenChronicle uses non-streaming completions, so request timeout is total response time. Streaming would need separate inactivity-timeout handling.
  • Sanity-clamping timeout_seconds = 0 or num_retries = 100 — explicit user choices are passed through; clamping would silently override what they wrote.

Test plan

  • uv run pytest — 76/76 pass
  • uv run ruff check src/ tests/test_writer_llm.py tests/test_config.py — clean

New tests cover both layers:

tests/test_writer_llm.py:

  • test_call_llm_passes_default_timeout_and_retries — patches litellm.completion, asserts module defaults are forwarded.
  • test_call_llm_per_stage_timeout_override — stage-level override wins.
  • test_call_llm_other_stages_keep_defaults_when_one_is_overridden — sibling stages don't bleed.
  • test_call_llm_mock_path_bypasses_litellmOPENCHRONICLE_LLM_MOCK=1 still short-circuits before litellm import.

tests/test_config.py:

  • test_timeout_and_retries_inherit_from_default — TOML inheritance from [models.default] to a stage section.
  • test_timeout_per_stage_overrides_default — per-stage override wins over default.
  • test_missing_timeout_defaults_to_none — old configs without the fields parse as None, triggering the "use module default" path in the wrapper.

``writer/llm.call_llm`` previously called ``litellm.completion(**kwargs)``
with no ``timeout`` and no ``num_retries``. Production litellm versions
default the request to "wait forever", so a stuck connection (slow
provider, partial response, dead TCP socket) blocks the reducer /
classifier daemon thread *forever* — the daemon stays "alive" but no
durable facts get written and there's no log entry to alert the user.
Combined with the recent orphan-session recovery work, this is the
remaining hang point that prevents end-to-end self-healing.

This adds two module-level defaults in ``writer/llm.py``:

* ``DEFAULT_TIMEOUT_SECONDS = 120`` — generous for a long session
  reduce without being absurd; per-stage overridable via
  ``[models.<stage>] timeout_seconds = N`` for slow local models.
* ``DEFAULT_NUM_RETRIES = 2`` — 3 total attempts; litellm uses
  tenacity-style backoff and respects ``Retry-After`` headers, so
  this absorbs a brief 429 / 502 blip without piling on latency.

Two new fields on ``ModelConfig`` (``timeout_seconds`` and
``num_retries``, both ``int | None = None``) carry the per-stage
override. ``None`` means "use module default", so an existing config
with no mention of these fields parses cleanly and gets the safe
defaults — no breakage for users on older configs.

The default config template gains commented-out lines documenting
both knobs, so first-run installs see the option even before they
need it.

Tests cover both layers:

* ``tests/test_writer_llm.py`` — patches ``litellm.completion`` and
  asserts the kwargs forwarded to it carry timeout/retries
  correctly: defaults applied with no override, per-stage override
  honored, sibling stages don't bleed, and the
  ``OPENCHRONICLE_LLM_MOCK`` short-circuit still bypasses litellm
  entirely.
* ``tests/test_config.py`` adds three regression tests for the TOML
  loading path: inheritance from ``[models.default]`` to a stage,
  per-stage override winning over default, and old configs without
  the fields parsing as ``None`` (the "use module default" sentinel).

76/76 pass; ruff clean.

Worst-case bounded behavior: a fully unresponsive provider keeps the
reducer thread blocked for ``timeout * (1 + num_retries)`` = 360 s
before the exception propagates up to the reducer's outer
``except``, which marks the session ``failed`` and lets the safety-
net retry tick handle it. Far better than the unbounded wait today.
Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces configurable timeouts and retry logic for LLM calls to prevent indefinite blocking of daemon threads. It adds timeout_seconds and num_retries fields to the ModelConfig dataclass, establishes module-level defaults (120 seconds and 2 retries, respectively), and ensures these parameters are correctly passed to the litellm library. Comprehensive tests were added to verify configuration inheritance, overrides, and correct parameter forwarding. I have no feedback to provide.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant