Follow-up scope for v2.6.0 from Nexus's review of PR #124 (v2.5.2 hotfix). The hotfix shipped sufficient defaults to make synthesis + causal extraction work end-to-end on qwen3.5:9b, but four hardening items deserve a clean PR rather than the inline approach v2.5.2 took.
1. Regression tests for max_tokens at each call site
Add a test that asserts each LLM call site uses a budget ≥ a known threshold for the prompt class. CI doesn't have an LLM, so the test should snapshot the literal value passed to generate(...) rather than make a network call. Catches accidental downward edits during refactors.
Affected call sites:
note_constructor.py causal triples (8000)
synthesis_generator.py (2500)
fact_extractor.py (2500)
entity_indexer.py NER + retry (2500)
memory_evolver.py (2500 × 2)
2. Config-overridable budgets per call site
Today the values are hardcoded literals in each module. A LLMConfig.max_tokens field (with optional per-call overrides like max_tokens_causal, max_tokens_synthesis) would let operators on faster hardware drop to 2500/2500 across the board, or operators on slower models bump to 12000+. Read default from config, fall back to the literal.
3. <think> tag stripping as post-processing guard
Some Ollama versions surface <think>...</think> tokens in the response field (vs. hiding them). When that happens, extract_json may trip on the prose preamble. A small strip_thinking_tags(raw) helper in json_parse.py that removes any <think>...</think> block before regex JSON match would harden the path against model/Ollama upgrades.
4. reasoning_model: bool config flag auto-scaling
A boolean (or string with model-family heuristics) that, when set, auto-scales:
timeout to ≥180s
- All
max_tokens defaults to the higher tier (8000/2500)
Avoids operators having to remember every knob individually. Off-by-default for backward compat.
Operational note from PR #124
Causal extraction wall-time is 60–140s per call on a 9B-Q4_K_M reasoning model, so remember(sync=True) blocks 1–3 minutes per note. The default async enrichment queue is unaffected; only operators triggering sync ingest or bulk-load see this.
🤖 Generated from Nexus's review of PR #124
Follow-up scope for v2.6.0 from Nexus's review of PR #124 (v2.5.2 hotfix). The hotfix shipped sufficient defaults to make synthesis + causal extraction work end-to-end on qwen3.5:9b, but four hardening items deserve a clean PR rather than the inline approach v2.5.2 took.
1. Regression tests for
max_tokensat each call siteAdd a test that asserts each LLM call site uses a budget ≥ a known threshold for the prompt class. CI doesn't have an LLM, so the test should snapshot the literal value passed to
generate(...)rather than make a network call. Catches accidental downward edits during refactors.Affected call sites:
note_constructor.pycausal triples (8000)synthesis_generator.py(2500)fact_extractor.py(2500)entity_indexer.pyNER + retry (2500)memory_evolver.py(2500 × 2)2. Config-overridable budgets per call site
Today the values are hardcoded literals in each module. A
LLMConfig.max_tokensfield (with optional per-call overrides likemax_tokens_causal,max_tokens_synthesis) would let operators on faster hardware drop to 2500/2500 across the board, or operators on slower models bump to 12000+. Read default from config, fall back to the literal.3.
<think>tag stripping as post-processing guardSome Ollama versions surface
<think>...</think>tokens in the response field (vs. hiding them). When that happens,extract_jsonmay trip on the prose preamble. A smallstrip_thinking_tags(raw)helper injson_parse.pythat removes any<think>...</think>block before regex JSON match would harden the path against model/Ollama upgrades.4.
reasoning_model: boolconfig flag auto-scalingA boolean (or string with model-family heuristics) that, when set, auto-scales:
timeoutto ≥180smax_tokensdefaults to the higher tier (8000/2500)Avoids operators having to remember every knob individually. Off-by-default for backward compat.
Operational note from PR #124
Causal extraction wall-time is 60–140s per call on a 9B-Q4_K_M reasoning model, so
remember(sync=True)blocks 1–3 minutes per note. The default async enrichment queue is unaffected; only operators triggering sync ingest or bulk-load see this.🤖 Generated from Nexus's review of PR #124