Skip to content

v2.6.0: harden reasoning-model LLM budget plumbing (post-#124 follow-ups) #125

@rolandpg

Description

@rolandpg

Follow-up scope for v2.6.0 from Nexus's review of PR #124 (v2.5.2 hotfix). The hotfix shipped sufficient defaults to make synthesis + causal extraction work end-to-end on qwen3.5:9b, but four hardening items deserve a clean PR rather than the inline approach v2.5.2 took.

1. Regression tests for max_tokens at each call site

Add a test that asserts each LLM call site uses a budget ≥ a known threshold for the prompt class. CI doesn't have an LLM, so the test should snapshot the literal value passed to generate(...) rather than make a network call. Catches accidental downward edits during refactors.

Affected call sites:

  • note_constructor.py causal triples (8000)
  • synthesis_generator.py (2500)
  • fact_extractor.py (2500)
  • entity_indexer.py NER + retry (2500)
  • memory_evolver.py (2500 × 2)

2. Config-overridable budgets per call site

Today the values are hardcoded literals in each module. A LLMConfig.max_tokens field (with optional per-call overrides like max_tokens_causal, max_tokens_synthesis) would let operators on faster hardware drop to 2500/2500 across the board, or operators on slower models bump to 12000+. Read default from config, fall back to the literal.

3. <think> tag stripping as post-processing guard

Some Ollama versions surface <think>...</think> tokens in the response field (vs. hiding them). When that happens, extract_json may trip on the prose preamble. A small strip_thinking_tags(raw) helper in json_parse.py that removes any <think>...</think> block before regex JSON match would harden the path against model/Ollama upgrades.

4. reasoning_model: bool config flag auto-scaling

A boolean (or string with model-family heuristics) that, when set, auto-scales:

  • timeout to ≥180s
  • All max_tokens defaults to the higher tier (8000/2500)

Avoids operators having to remember every knob individually. Off-by-default for backward compat.

Operational note from PR #124

Causal extraction wall-time is 60–140s per call on a 9B-Q4_K_M reasoning model, so remember(sync=True) blocks 1–3 minutes per note. The default async enrichment queue is unaffected; only operators triggering sync ingest or bulk-load see this.

🤖 Generated from Nexus's review of PR #124

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions