fix: P0 items — LLM budget plumbing, bulk ingest, CCCS regex hardening by rolandpg · Pull Request #146 · rolandpg/zettelforge

rolandpg · 2026-04-28T04:44:47Z

Implements all three P0 items for v2.7.0.

Issue #125 — LLM budget plumbing

Config-overridable max_tokens per call site (causal/synthesis/extraction/ner/evolve)
strip_thinking_tags() in json_parse.py to remove `` blocks
reasoning_model: bool auto-scaling flag (doubles budgets, >=180s timeout)
Env var + config.yaml support for all new fields
Regression tests at tests/test_max_tokens_budgets.py

Issue #72 — Bulk ingest bottleneck

bulk=True path on yara/ingest and sigma/ingest (sync=False + flush())
Cached Plyara singleton in yara/parser.py (grammar compiled once)
MemoryManager.flush() method for batch enrichment drain

Issue #73 — CCCS regex hardening (SEC-6/SEC-7)

_regex_match uses fullmatch() to prevent multiline injection
Author regex tightened to safe printable ASCII
Negative test cases in tests/test_cccs_metadata.py

Testing

62 config/json/token budget tests passed
63 YARA/Sigma tests passed
Basic + core tests passed (structlog messages are warnings, not failures)
Closes v2.6.0: harden reasoning-model LLM budget plumbing (post-#124 follow-ups) #125, closes MemoryManager.remember(sync=True) dominates bulk ingest; YARA p95 plyara tail #72, closes Tighten CCCS metadata regexes (SEC-6 / SEC-7) #73

…x hardening Issue #125 — Harden reasoning-model LLM budget plumbing: - Config-overridable max_tokens per call site (max_tokens_causal/synthesis/extraction/ner/evolve) - strip_thinking_tags() in json_parse.py to remove thinking... blocks - reasoning_model: bool auto-scaling flag - Env var + config.yaml support for all new fields - Regression tests at tests/test_max_tokens_budgets.py Issue #72 — remember(sync=True) bulk ingest bottleneck: - bulk=True path on yara/ingest and sigma/ingest (sync=False + flush()) - Cached Plyara singleton in yara/parser.py (grammar compiled once) - MemoryManager.flush() method for batch enrichment drain Issue #73 — CCCS regex hardening (SEC-6/SEC-7): - _regex_match uses fullmatch() to prevent multiline injection - Author regex tightened to safe printable ASCII - Negative test cases in tests/test_cccs_metadata.py All existing tests pass (125+). Closes #125, closes #72, closes #73.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: b8cbc50d78

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Copilot

Pull request overview

Implements the v2.7.0 P0 follow-ups across LLM budget configuration, bulk ingest performance, and CCCS metadata regex hardening.

Changes:

Adds per-call-site max_tokens_* config knobs plus reasoning_model auto-scaling and <think>/<thinking> stripping during JSON extraction.
Speeds up bulk YARA/Sigma ingest via bulk=True (async enrichment + a final MemoryManager.flush()), and caches Plyara to avoid repeated grammar compilation.
Hardens CCCS metadata validation by tightening the author regex and switching regex checks to fullmatch() to prevent multiline-injection bypasses.

Reviewed changes

Copilot reviewed 16 out of 17 changed files in this pull request and generated 16 comments.

Show a summary per file

File	Description
tests/test_vector_memory.py	Adds VectorMemory tests (currently inconsistent with the implementation’s API).
tests/test_max_tokens_budgets.py	Adds regression tests for budget defaults/env overrides and think-tag stripping.
tests/test_cccs_metadata.py	Adds negative/positive tests for CCCS regex hardening.
src/zettelforge/yara/parser.py	Introduces cached `Plyara` instance and clears per parse.
src/zettelforge/yara/ingest.py	Adds `bulk` option to defer enrichment and flush once at end.
src/zettelforge/yara/cccs_metadata.py	Tightens `_AUTHOR_REGEX` and uses `fullmatch()` in `_regex_match`.
src/zettelforge/synthesis_generator.py	Plumbs `cfg.llm.max_tokens_synthesis` into `generate()`.
src/zettelforge/sigma/ingest.py	Adds `sync` plumb-through and `bulk` mode with optional flush.
src/zettelforge/note_constructor.py	Plumbs `cfg.llm.max_tokens_causal` into `generate()`.
src/zettelforge/memory_manager.py	Adds `flush()` for synchronous enrichment draining after async ingest.
src/zettelforge/memory_evolver.py	Plumbs `cfg.llm.max_tokens_evolve` into `generate()` (incl. retry).
src/zettelforge/json_parse.py	Adds `strip_thinking_tags()` and applies it before fence stripping.
src/zettelforge/fact_extractor.py	Plumbs `cfg.llm.max_tokens_extraction` into `generate()`.
src/zettelforge/entity_indexer.py	Plumbs `cfg.llm.max_tokens_ner` into `generate()` (incl. retry).
src/zettelforge/config.py	Adds new LLM budget fields, env overrides, and reasoning-model scaling.
config.default.yaml	Documents new env vars and YAML config keys for budgets/scaling.
src/zettelforge/scripts/init.py	Part of the PR file set (no diff shown in provided context).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-28T04:51:25Z

+    def test_note_constructor_causal_budget(self):
+        """note_constructor.extract_causal_triples: max_tokens >= 8000."""
+        cfg = get_config()
+        assert cfg.llm.max_tokens_causal >= 8000, (
+            f"Expected max_tokens_causal >= 8000, got {cfg.llm.max_tokens_causal}"
+        )
+
+    def test_synthesis_generator_budget(self):
+        """synthesis_generator._generate_synthesis: max_tokens >= 2500."""
+        cfg = get_config()
+        assert cfg.llm.max_tokens_synthesis >= 2500, (
+            f"Expected max_tokens_synthesis >= 2500, got {cfg.llm.max_tokens_synthesis}"
+        )
+
+    def test_fact_extractor_budget(self):
+        """fact_extractor.extract: max_tokens >= 2500."""
+        cfg = get_config()
+        assert cfg.llm.max_tokens_extraction >= 2500, (
+            f"Expected max_tokens_extraction >= 2500, got {cfg.llm.max_tokens_extraction}"
+        )
+
+    def test_entity_indexer_ner_budget(self):
+        """entity_indexer.extract_llm: max_tokens >= 2500."""
+        cfg = get_config()
+        assert cfg.llm.max_tokens_ner >= 2500, (
+            f"Expected max_tokens_ner >= 2500, got {cfg.llm.max_tokens_ner}"
+        )
+
+    def test_memory_evolver_budget(self):
+        """memory_evolver.evaluate_evolution: max_tokens >= 2500."""
+        cfg = get_config()
+        assert cfg.llm.max_tokens_evolve >= 2500, (
+            f"Expected max_tokens_evolve >= 2500, got {cfg.llm.max_tokens_evolve}"
+        )


These tests only assert config defaults/overrides, but they don't verify that each call site actually passes the configured max_tokens_* value into llm_client.generate(). To catch regressions in the “plumbing”, consider mocking generate in each module and asserting the max_tokens argument matches the corresponding cfg.llm.max_tokens_* field.

Copilot · 2026-04-28T04:51:25Z

+def strip_thinking_tags(text: str) -> str:
+    """Strip  ********  ******** thinking tags from LLM output.
+
+    Reasoning models wrap internal reasoning in  ******** ...  ********
+    tags.  These must be removed before JSON extraction, otherwise the
+    regex searches match the  ******** tag text instead of the JSON
+    payload.
+
+    Args:
+        text: Raw LLM output that may contain  ******** ...  ******** tags.
+
+    Returns:
+        Cleaned text with **thinking**/** and <thinking>/** blocks removed.
+    """
+    return re.sub(r"(?:\*\*thinking\*\*.*?\*\*|<think(?:ing)?>.*?</think(?:ing)?>)", "", text, flags=re.DOTALL)


The docstring/comments for strip_thinking_tags() contain placeholder ******** text (and the Returns description is malformed). Replace this with the actual tag forms being stripped (e.g. <think>...</think>, <thinking>...</thinking>, and/or any markdown variants) so future readers understand what the function does.

Copilot · 2026-04-28T04:51:25Z

+import time
+from unittest.mock import patch, MagicMock


Unused imports time and MagicMock will trigger Ruff F401 in tests (repo enables F rules). Remove them or use them in the tests.

Suggested change

import time

from unittest.mock import patch, MagicMock

from unittest.mock import patch

Copilot · 2026-04-28T04:51:25Z

+        # Search for it
+        results = vm.search("special chars", k=5)
+        assert len(results) >= 1
+        assert special_text in results[0]["text"]


VectorMemory.search() takes top_k (not k). Update this call to use top_k to avoid TypeError.

Copilot · 2026-04-28T04:51:25Z

+        # Search should still work
+        results = vm.search("memory number 5", k=5)
+        assert len(results) >= 1
+        assert "memory number 5" in results[0]["text"]


VectorMemory.search() takes top_k (not k). Update this call to use top_k to avoid TypeError.

…rator

…g newlines Python's $ anchor matches before an optional trailing \n, allowing injection like '1.0\n'. \Z matches strictly at end of string. Fixes test_cccs_metadata on 3.12/3.13 CI runners.

rolandpg

Review summary

Scope is clean and matches the description: per-call-site max_tokens plumbing through LLMConfig, MemoryManager.flush() for bulk ingest, Plyara singleton caching, CCCS regex hardening (SEC-6/SEC-7), and the <think>-tag stripper. Tests for all three areas are included. I'm comfortable with this landing — the items below are non-blocking for the merge but worth a follow-up.

Blocking? No. Worth fixing in a follow-up:

1. strip_thinking_tags regex (json_parse.py)

r"(?:\*\*thinking\*\*.*?\*\*|<think(?:ing)?>.*?</think(?:ing)?>)"

The first alternative (**thinking**...**) doesn't match any reasoning model I'm aware of — DeepSeek-R1 / QwQ / Qwen3.5-thinking all emit <think>...</think> or <thinking>...</thinking>. The **...** close also isn't anchored, so it could greedily chew through legitimate markdown bold inside a JSON string value. Tests only cover the <think>/<thinking> paths, so the dead branch is also untested. Recommend either dropping the **thinking** alternative or adding a passing test that proves it's needed.

2. Plyara singleton caching is not thread-safe (yara/parser.py)
The comment says "parse_string() is thread-safe — it creates fresh parser objects per call and does not mutate shared instance state — so a single cached instance is safe to reuse", then on the very next call site the code does parser.clear() to flush accumulated state from a prior call. Two concurrent callers will race: thread A's clear() can wipe state mid-parse from thread B's parse_string(). Bulk ingest is currently single-threaded so this isn't observable today, but the comment is misleading and the contract is fragile. Suggest either (a) a threading.Lock around _get_plyara() + clear() + parse_string(), or (b) threading.local() for the cached instance.

3. CCCS author regex tightening is a behavior change (yara/cccs_metadata.py)
The old _AUTHOR_REGEX = ...|^[A-Z\s._\-]+$|^.*$ accepted multi-word authors with spaces ("CCCS Group") via the second alternative; the new ^[A-Za-z0-9_.@+-]+$ rejects them. The third alternative ^.*$ made the old regex a no-op so SEC-7 is the right call, but operators with existing rules whose author = "Some Org Name" will now fail strict-tier validation. Consider:

A short note in the v2.7.0 release notes calling this out, and
Either widening to ^[A-Za-z0-9_. @+-]+$ (allow space) or documenting that authors must be single tokens.

4. Anchoring inconsistency
You changed _VERSION_REGEX, _DATE_REGEX, _UUID_REGEX, _FINGERPRINT_REGEX, _MITRE_ATT_REGEX from $ to \Z (good — $ accepts trailing \n in default mode), but _AUTHOR_REGEX still ends with $. So "jdoe@CCCS\n" will pass author validation while "1.0\n" fails version validation. Suggest making author \Z too for consistency.

5. MemoryManager.flush() queue accounting (memory_manager.py)
The BackendClosedError branch returns without task_done(), which will leave _enrichment_queue.unfinished_tasks permanently > 0 and break any downstream queue.join(). Either decrement or move the task_done() to a try/finally.

6. _apply_reasoning_model_scaling doubles already-tuned values
If an operator already set max_tokens_causal: 16000 in YAML and toggles reasoning_model: true, the scaler bumps to 32000 — likely unintended. Suggest scaling against the defaults rather than the current value, or only bumping if current < reasoning_floor.

7. Scope creep: tests/test_vector_memory.py
261-line test file unrelated to LLM budgets / bulk ingest / CCCS regex. It exercises existing behavior without touching VectorMemory itself, so it's a backfill, not a regression test for this PR. Fine to keep, but the PR description should mention it.

8. Docstring de-Sphinxification
A bunch of docstrings lost :class:, :meth:, and double-backtick markup (sigma/ingest.py, others). MkDocs / docs build will render these as plain text. Not blocking but worth a single doc-formatting follow-up.

Verified clean

Env-var plumbing for all 6 new fields is correct and consistent (int() parse with warning on bad value).
LLMConfig.__repr__ is updated to include the new fields without leaking secrets.
All 5 production call-site swaps (note_constructor, synthesis_generator, fact_extractor, entity_indexer, memory_evolver) read from cfg.llm.max_tokens_* rather than literals.
SEC-6 fix (fullmatch instead of search) is correct and the test cases cover the multiline-injection vector.
New tests/* ruff ignore (S101, RUF012) is the standard pattern.

LGTM to merge after a green CI run; the items above can land in a follow-up.

Generated by Claude Code

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Signed-off-by: Patrick Roland <48327651+rolandpg@users.noreply.github.com>

Copilot AI review requested due to automatic review settings April 28, 2026 04:44

Copilot started reviewing on behalf of rolandpg April 28, 2026 04:45 View session

chatgpt-codex-connector Bot reviewed Apr 28, 2026

View reviewed changes

Comment thread src/zettelforge/memory_manager.py

Comment thread src/zettelforge/yara/parser.py

Copilot AI reviewed Apr 28, 2026

View reviewed changes

rolandpg added 3 commits April 27, 2026 23:53

fix(lint): add per-file-ignores for S101/RUF012 in test files

9f4c97f

style: ruff format fix for entity_indexer, json_parse, synthesis_gene…

b4f0d10

…rator

fix: use \Z anchor instead of $ on all CCCS regexes to reject trailin…

7bb4359

…g newlines Python's $ anchor matches before an optional trailing \n, allowing injection like '1.0\n'. \Z matches strictly at end of string. Fixes test_cccs_metadata on 3.12/3.13 CI runners.

rolandpg commented Apr 29, 2026

View reviewed changes

rolandpg mentioned this pull request Apr 29, 2026

fix(yara): tighten CCCS metadata regexes (SEC-6 / SEC-7) #145

Closed

Update tests/test_vector_memory.py

70d2182

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Signed-off-by: Patrick Roland <48327651+rolandpg@users.noreply.github.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: P0 items — LLM budget plumbing, bulk ingest, CCCS regex hardening#146

fix: P0 items — LLM budget plumbing, bulk ingest, CCCS regex hardening#146
rolandpg wants to merge 5 commits into
masterfrom
fix/p0-llm-budgets-bulk-ingest-cccs

rolandpg commented Apr 28, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI Apr 28, 2026

Uh oh!

Copilot AI Apr 28, 2026

Uh oh!

Copilot AI Apr 28, 2026

Uh oh!

Copilot AI Apr 28, 2026

Uh oh!

Copilot AI Apr 28, 2026

Uh oh!

rolandpg left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	import time
	from unittest.mock import patch, MagicMock
	from unittest.mock import patch

Conversation

rolandpg commented Apr 28, 2026

Issue #125 — LLM budget plumbing

Issue #72 — Bulk ingest bottleneck

Issue #73 — CCCS regex hardening (SEC-6/SEC-7)

Testing

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI Apr 28, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 28, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 28, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 28, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 28, 2026

Choose a reason for hiding this comment

Uh oh!

rolandpg left a comment

Choose a reason for hiding this comment

Review summary

Blocking? No. Worth fixing in a follow-up:

Verified clean

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants