fix(writer): explicit timeout + retries on every LLM call by Chen17-sq · Pull Request #19 · Einsia/OpenChronicle

Chen17-sq · 2026-04-27T08:16:17Z

Summary — silent reducer / classifier hang

writer/llm.call_llm calls litellm.completion(**kwargs) with no timeout and no num_retries. Production litellm versions default the request to "wait forever", so a stuck connection (slow provider, partial response, dead TCP socket) blocks the reducer or classifier daemon thread indefinitely. The daemon stays "alive" but no durable facts get written and nothing in the logs alerts the user. Combined with the orphan-session recovery in #18, this is the last hang point that prevents end-to-end self-healing.

Fix

Two module-level defaults in writer/llm.py:

DEFAULT_TIMEOUT_SECONDS = 120 — generous for a long session reduce without being absurd; reasoning-model / slow local-model users can raise it per stage.
DEFAULT_NUM_RETRIES = 2 — 3 total attempts. litellm uses tenacity-style backoff and respects Retry-After headers, so this absorbs a brief 429 / 502 blip without adding noticeable latency.

Two new fields on ModelConfig: timeout_seconds: int | None = None and num_retries: int | None = None. None means "use module default", so existing configs that don't mention these parse cleanly and get the safe defaults — no breakage for upgrades.

The default config template gains commented-out lines documenting both knobs:

[models.default]
# timeout_seconds = 120 # per-call timeout; raise this if you run a slow local model
# num_retries = 2       # automatic retries on transient errors (429, network blips)

Worst-case behavior

A fully unresponsive provider keeps the reducer thread blocked for timeout * (1 + num_retries) = 360 s before the exception propagates to the reducer's outer except, which marks the session failed and lets the safety-net retry tick handle it. Far better than today's unbounded wait.

Relationship to other PRs

fix(session): recover orphan active sessions on daemon startup #18 (orphan-session recovery) — orthogonal but complementary. fix(session): recover orphan active sessions on daemon startup #18 ensures sessions don't get stuck in active after a hard crash; this PR ensures the eventual reduce of those sessions can't hang on a slow LLM provider. Together: complete self-healing.
fix(store): atomic writes for memory files #11 / fix(store): per-path lock for concurrent memory-file writes #12 — different files, no overlap.

Mergeable in any order.

Out of scope

Streaming / partial-response timeouts — OpenChronicle uses non-streaming completions, so request timeout is total response time. Streaming would need separate inactivity-timeout handling.
Sanity-clamping timeout_seconds = 0 or num_retries = 100 — explicit user choices are passed through; clamping would silently override what they wrote.

Test plan

uv run pytest — 76/76 pass
uv run ruff check src/ tests/test_writer_llm.py tests/test_config.py — clean

New tests cover both layers:

tests/test_writer_llm.py:

test_call_llm_passes_default_timeout_and_retries — patches litellm.completion, asserts module defaults are forwarded.
test_call_llm_per_stage_timeout_override — stage-level override wins.
test_call_llm_other_stages_keep_defaults_when_one_is_overridden — sibling stages don't bleed.
test_call_llm_mock_path_bypasses_litellm — OPENCHRONICLE_LLM_MOCK=1 still short-circuits before litellm import.

tests/test_config.py:

test_timeout_and_retries_inherit_from_default — TOML inheritance from [models.default] to a stage section.
test_timeout_per_stage_overrides_default — per-stage override wins over default.
test_missing_timeout_defaults_to_none — old configs without the fields parse as None, triggering the "use module default" path in the wrapper.

``writer/llm.call_llm`` previously called ``litellm.completion(**kwargs)`` with no ``timeout`` and no ``num_retries``. Production litellm versions default the request to "wait forever", so a stuck connection (slow provider, partial response, dead TCP socket) blocks the reducer / classifier daemon thread *forever* — the daemon stays "alive" but no durable facts get written and there's no log entry to alert the user. Combined with the recent orphan-session recovery work, this is the remaining hang point that prevents end-to-end self-healing. This adds two module-level defaults in ``writer/llm.py``: * ``DEFAULT_TIMEOUT_SECONDS = 120`` — generous for a long session reduce without being absurd; per-stage overridable via ``[models.<stage>] timeout_seconds = N`` for slow local models. * ``DEFAULT_NUM_RETRIES = 2`` — 3 total attempts; litellm uses tenacity-style backoff and respects ``Retry-After`` headers, so this absorbs a brief 429 / 502 blip without piling on latency. Two new fields on ``ModelConfig`` (``timeout_seconds`` and ``num_retries``, both ``int | None = None``) carry the per-stage override. ``None`` means "use module default", so an existing config with no mention of these fields parses cleanly and gets the safe defaults — no breakage for users on older configs. The default config template gains commented-out lines documenting both knobs, so first-run installs see the option even before they need it. Tests cover both layers: * ``tests/test_writer_llm.py`` — patches ``litellm.completion`` and asserts the kwargs forwarded to it carry timeout/retries correctly: defaults applied with no override, per-stage override honored, sibling stages don't bleed, and the ``OPENCHRONICLE_LLM_MOCK`` short-circuit still bypasses litellm entirely. * ``tests/test_config.py`` adds three regression tests for the TOML loading path: inheritance from ``[models.default]`` to a stage, per-stage override winning over default, and old configs without the fields parsing as ``None`` (the "use module default" sentinel). 76/76 pass; ruff clean. Worst-case bounded behavior: a fully unresponsive provider keeps the reducer thread blocked for ``timeout * (1 + num_retries)`` = 360 s before the exception propagates up to the reducer's outer ``except``, which marks the session ``failed`` and lets the safety- net retry tick handle it. Far better than the unbounded wait today.

gemini-code-assist

Code Review

This pull request introduces configurable timeouts and retry logic for LLM calls to prevent indefinite blocking of daemon threads. It adds timeout_seconds and num_retries fields to the ModelConfig dataclass, establishes module-level defaults (120 seconds and 2 retries, respectively), and ensures these parameters are correctly passed to the litellm library. Comprehensive tests were added to verify configuration inheritance, overrides, and correct parameter forwarding. I have no feedback to provide.

gemini-code-assist Bot reviewed Apr 27, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(writer): explicit timeout + retries on every LLM call#19

fix(writer): explicit timeout + retries on every LLM call#19
Chen17-sq wants to merge 1 commit intoEinsia:mainfrom
Chen17-sq:fix/llm-timeout-and-retries

Chen17-sq commented Apr 27, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Chen17-sq commented Apr 27, 2026

Summary — silent reducer / classifier hang

Fix

Worst-case behavior

Relationship to other PRs

Out of scope

Test plan

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant