v2.6.0: harden reasoning-model LLM budget plumbing (post-#124 follow-ups)

Follow-up scope for v2.6.0 from Nexus's review of PR #124 (v2.5.2 hotfix). The hotfix shipped sufficient defaults to make synthesis + causal extraction work end-to-end on qwen3.5:9b, but four hardening items deserve a clean PR rather than the inline approach v2.5.2 took.

## 1. Regression tests for `max_tokens` at each call site

Add a test that asserts each LLM call site uses a budget ≥ a known threshold for the prompt class. CI doesn't have an LLM, so the test should snapshot the literal value passed to `generate(...)` rather than make a network call. Catches accidental downward edits during refactors.

Affected call sites:
- `note_constructor.py` causal triples (8000)
- `synthesis_generator.py` (2500)
- `fact_extractor.py` (2500)
- `entity_indexer.py` NER + retry (2500)
- `memory_evolver.py` (2500 × 2)

## 2. Config-overridable budgets per call site

Today the values are hardcoded literals in each module. A `LLMConfig.max_tokens` field (with optional per-call overrides like `max_tokens_causal`, `max_tokens_synthesis`) would let operators on faster hardware drop to 2500/2500 across the board, or operators on slower models bump to 12000+. Read default from config, fall back to the literal.

## 3. `<think>` tag stripping as post-processing guard

Some Ollama versions surface `<think>...</think>` tokens *in* the response field (vs. hiding them). When that happens, `extract_json` may trip on the prose preamble. A small `strip_thinking_tags(raw)` helper in `json_parse.py` that removes any `<think>...</think>` block before regex JSON match would harden the path against model/Ollama upgrades.

## 4. `reasoning_model: bool` config flag auto-scaling

A boolean (or string with model-family heuristics) that, when set, auto-scales:
- `timeout` to ≥180s
- All `max_tokens` defaults to the higher tier (8000/2500)

Avoids operators having to remember every knob individually. Off-by-default for backward compat.

## Operational note from PR #124

Causal extraction wall-time is 60–140s per call on a 9B-Q4_K_M reasoning model, so `remember(sync=True)` blocks 1–3 minutes per note. The default async enrichment queue is unaffected; only operators triggering sync ingest or bulk-load see this.

🤖 Generated from [Nexus's review of PR #124](https://github.com/rolandpg/zettelforge/pull/124)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v2.6.0: harden reasoning-model LLM budget plumbing (post-#124 follow-ups) #125

1. Regression tests for `max_tokens` at each call site

2. Config-overridable budgets per call site

3. `<think>` tag stripping as post-processing guard

4. `reasoning_model: bool` config flag auto-scaling

Operational note from PR #124

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

v2.6.0: harden reasoning-model LLM budget plumbing (post-#124 follow-ups) #125

Description

1. Regression tests for max_tokens at each call site

2. Config-overridable budgets per call site

3. <think> tag stripping as post-processing guard

4. reasoning_model: bool config flag auto-scaling

Operational note from PR #124

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions

1. Regression tests for `max_tokens` at each call site

3. `<think>` tag stripping as post-processing guard

4. `reasoning_model: bool` config flag auto-scaling