Skip to content

fix(llm): add timeout and transient failure handling to Ollama calls #54

@SeanClay10

Description

@SeanClay10

Problem

Both chat() calls in src/llm/llm_client.py (initial extraction at line 128, retry at line 187) have no timeout and no error handling around the network call itself. The current retry logic only handles null fields in a successful response — it does not handle connection errors, server errors, or hung requests.

This means:

  • If the Ollama server becomes unresponsive mid-run, the pipeline hangs indefinitely with no output and no way to detect it without watching the terminal
  • In multiprocessing mode (classify_extract.py --workers N), a single hung worker blocks its slot in the pool for the rest of the run
  • A transient 503 or connection reset kills the entire job for that file rather than retrying

Tasks

  • Wrap both chat() calls in a try/except that catches connection errors, timeouts, and malformed response errors
  • Add a configurable timeout to each chat() call (e.g. --llm-timeout CLI flag, defaulting to a reasonable value like 120s)
  • Implement retry with exponential backoff for transient failures (network errors, 5xx responses), separate from the existing null-field retry
  • Log the raw response content on failure to aid debugging
  • Ensure a hung or failed worker in multiprocessing mode produces a clear error row in the summary CSV rather than silently disappearing

Context

src/llm/llm_client.py:128 and :187 are the two call sites. The existing null-field retry at line 148 is unrelated — it re-prompts on a successful response that returned nulls, not on a failed call.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions