fix: P0 items — LLM budget plumbing, bulk ingest, CCCS regex hardening#146
fix: P0 items — LLM budget plumbing, bulk ingest, CCCS regex hardening#146rolandpg wants to merge 5 commits into
Conversation
…x hardening Issue #125 — Harden reasoning-model LLM budget plumbing: - Config-overridable max_tokens per call site (max_tokens_causal/synthesis/extraction/ner/evolve) - strip_thinking_tags() in json_parse.py to remove thinking... blocks - reasoning_model: bool auto-scaling flag - Env var + config.yaml support for all new fields - Regression tests at tests/test_max_tokens_budgets.py Issue #72 — remember(sync=True) bulk ingest bottleneck: - bulk=True path on yara/ingest and sigma/ingest (sync=False + flush()) - Cached Plyara singleton in yara/parser.py (grammar compiled once) - MemoryManager.flush() method for batch enrichment drain Issue #73 — CCCS regex hardening (SEC-6/SEC-7): - _regex_match uses fullmatch() to prevent multiline injection - Author regex tightened to safe printable ASCII - Negative test cases in tests/test_cccs_metadata.py All existing tests pass (125+). Closes #125, closes #72, closes #73.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: b8cbc50d78
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
There was a problem hiding this comment.
Pull request overview
Implements the v2.7.0 P0 follow-ups across LLM budget configuration, bulk ingest performance, and CCCS metadata regex hardening.
Changes:
- Adds per-call-site
max_tokens_*config knobs plusreasoning_modelauto-scaling and<think>/<thinking>stripping during JSON extraction. - Speeds up bulk YARA/Sigma ingest via
bulk=True(async enrichment + a finalMemoryManager.flush()), and cachesPlyarato avoid repeated grammar compilation. - Hardens CCCS metadata validation by tightening the author regex and switching regex checks to
fullmatch()to prevent multiline-injection bypasses.
Reviewed changes
Copilot reviewed 16 out of 17 changed files in this pull request and generated 16 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/test_vector_memory.py | Adds VectorMemory tests (currently inconsistent with the implementation’s API). |
| tests/test_max_tokens_budgets.py | Adds regression tests for budget defaults/env overrides and think-tag stripping. |
| tests/test_cccs_metadata.py | Adds negative/positive tests for CCCS regex hardening. |
| src/zettelforge/yara/parser.py | Introduces cached Plyara instance and clears per parse. |
| src/zettelforge/yara/ingest.py | Adds bulk option to defer enrichment and flush once at end. |
| src/zettelforge/yara/cccs_metadata.py | Tightens _AUTHOR_REGEX and uses fullmatch() in _regex_match. |
| src/zettelforge/synthesis_generator.py | Plumbs cfg.llm.max_tokens_synthesis into generate(). |
| src/zettelforge/sigma/ingest.py | Adds sync plumb-through and bulk mode with optional flush. |
| src/zettelforge/note_constructor.py | Plumbs cfg.llm.max_tokens_causal into generate(). |
| src/zettelforge/memory_manager.py | Adds flush() for synchronous enrichment draining after async ingest. |
| src/zettelforge/memory_evolver.py | Plumbs cfg.llm.max_tokens_evolve into generate() (incl. retry). |
| src/zettelforge/json_parse.py | Adds strip_thinking_tags() and applies it before fence stripping. |
| src/zettelforge/fact_extractor.py | Plumbs cfg.llm.max_tokens_extraction into generate(). |
| src/zettelforge/entity_indexer.py | Plumbs cfg.llm.max_tokens_ner into generate() (incl. retry). |
| src/zettelforge/config.py | Adds new LLM budget fields, env overrides, and reasoning-model scaling. |
| config.default.yaml | Documents new env vars and YAML config keys for budgets/scaling. |
| src/zettelforge/scripts/init.py | Part of the PR file set (no diff shown in provided context). |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| def test_note_constructor_causal_budget(self): | ||
| """note_constructor.extract_causal_triples: max_tokens >= 8000.""" | ||
| cfg = get_config() | ||
| assert cfg.llm.max_tokens_causal >= 8000, ( | ||
| f"Expected max_tokens_causal >= 8000, got {cfg.llm.max_tokens_causal}" | ||
| ) | ||
|
|
||
| def test_synthesis_generator_budget(self): | ||
| """synthesis_generator._generate_synthesis: max_tokens >= 2500.""" | ||
| cfg = get_config() | ||
| assert cfg.llm.max_tokens_synthesis >= 2500, ( | ||
| f"Expected max_tokens_synthesis >= 2500, got {cfg.llm.max_tokens_synthesis}" | ||
| ) | ||
|
|
||
| def test_fact_extractor_budget(self): | ||
| """fact_extractor.extract: max_tokens >= 2500.""" | ||
| cfg = get_config() | ||
| assert cfg.llm.max_tokens_extraction >= 2500, ( | ||
| f"Expected max_tokens_extraction >= 2500, got {cfg.llm.max_tokens_extraction}" | ||
| ) | ||
|
|
||
| def test_entity_indexer_ner_budget(self): | ||
| """entity_indexer.extract_llm: max_tokens >= 2500.""" | ||
| cfg = get_config() | ||
| assert cfg.llm.max_tokens_ner >= 2500, ( | ||
| f"Expected max_tokens_ner >= 2500, got {cfg.llm.max_tokens_ner}" | ||
| ) | ||
|
|
||
| def test_memory_evolver_budget(self): | ||
| """memory_evolver.evaluate_evolution: max_tokens >= 2500.""" | ||
| cfg = get_config() | ||
| assert cfg.llm.max_tokens_evolve >= 2500, ( | ||
| f"Expected max_tokens_evolve >= 2500, got {cfg.llm.max_tokens_evolve}" | ||
| ) |
There was a problem hiding this comment.
These tests only assert config defaults/overrides, but they don't verify that each call site actually passes the configured max_tokens_* value into llm_client.generate(). To catch regressions in the “plumbing”, consider mocking generate in each module and asserting the max_tokens argument matches the corresponding cfg.llm.max_tokens_* field.
| def strip_thinking_tags(text: str) -> str: | ||
| """Strip ******** ******** thinking tags from LLM output. | ||
|
|
||
| Reasoning models wrap internal reasoning in ******** ... ******** | ||
| tags. These must be removed before JSON extraction, otherwise the | ||
| regex searches match the ******** tag text instead of the JSON | ||
| payload. | ||
|
|
||
| Args: | ||
| text: Raw LLM output that may contain ******** ... ******** tags. | ||
|
|
||
| Returns: | ||
| Cleaned text with **thinking**/** and <thinking>/** blocks removed. | ||
| """ | ||
| return re.sub(r"(?:\*\*thinking\*\*.*?\*\*|<think(?:ing)?>.*?</think(?:ing)?>)", "", text, flags=re.DOTALL) |
There was a problem hiding this comment.
The docstring/comments for strip_thinking_tags() contain placeholder ******** text (and the Returns description is malformed). Replace this with the actual tag forms being stripped (e.g. <think>...</think>, <thinking>...</thinking>, and/or any markdown variants) so future readers understand what the function does.
| import time | ||
| from unittest.mock import patch, MagicMock |
There was a problem hiding this comment.
Unused imports time and MagicMock will trigger Ruff F401 in tests (repo enables F rules). Remove them or use them in the tests.
| import time | |
| from unittest.mock import patch, MagicMock | |
| from unittest.mock import patch |
| # Search for it | ||
| results = vm.search("special chars", k=5) | ||
| assert len(results) >= 1 | ||
| assert special_text in results[0]["text"] |
There was a problem hiding this comment.
VectorMemory.search() takes top_k (not k). Update this call to use top_k to avoid TypeError.
| # Search should still work | ||
| results = vm.search("memory number 5", k=5) | ||
| assert len(results) >= 1 | ||
| assert "memory number 5" in results[0]["text"] |
There was a problem hiding this comment.
VectorMemory.search() takes top_k (not k). Update this call to use top_k to avoid TypeError.
…g newlines Python's $ anchor matches before an optional trailing \n, allowing injection like '1.0\n'. \Z matches strictly at end of string. Fixes test_cccs_metadata on 3.12/3.13 CI runners.
rolandpg
left a comment
There was a problem hiding this comment.
Review summary
Scope is clean and matches the description: per-call-site max_tokens plumbing through LLMConfig, MemoryManager.flush() for bulk ingest, Plyara singleton caching, CCCS regex hardening (SEC-6/SEC-7), and the <think>-tag stripper. Tests for all three areas are included. I'm comfortable with this landing — the items below are non-blocking for the merge but worth a follow-up.
Blocking? No. Worth fixing in a follow-up:
1. strip_thinking_tags regex (json_parse.py)
r"(?:\*\*thinking\*\*.*?\*\*|<think(?:ing)?>.*?</think(?:ing)?>)"The first alternative (**thinking**...**) doesn't match any reasoning model I'm aware of — DeepSeek-R1 / QwQ / Qwen3.5-thinking all emit <think>...</think> or <thinking>...</thinking>. The **...** close also isn't anchored, so it could greedily chew through legitimate markdown bold inside a JSON string value. Tests only cover the <think>/<thinking> paths, so the dead branch is also untested. Recommend either dropping the **thinking** alternative or adding a passing test that proves it's needed.
2. Plyara singleton caching is not thread-safe (yara/parser.py)
The comment says "parse_string() is thread-safe — it creates fresh parser objects per call and does not mutate shared instance state — so a single cached instance is safe to reuse", then on the very next call site the code does parser.clear() to flush accumulated state from a prior call. Two concurrent callers will race: thread A's clear() can wipe state mid-parse from thread B's parse_string(). Bulk ingest is currently single-threaded so this isn't observable today, but the comment is misleading and the contract is fragile. Suggest either (a) a threading.Lock around _get_plyara() + clear() + parse_string(), or (b) threading.local() for the cached instance.
3. CCCS author regex tightening is a behavior change (yara/cccs_metadata.py)
The old _AUTHOR_REGEX = ...|^[A-Z\s._\-]+$|^.*$ accepted multi-word authors with spaces ("CCCS Group") via the second alternative; the new ^[A-Za-z0-9_.@+-]+$ rejects them. The third alternative ^.*$ made the old regex a no-op so SEC-7 is the right call, but operators with existing rules whose author = "Some Org Name" will now fail strict-tier validation. Consider:
- A short note in the v2.7.0 release notes calling this out, and
- Either widening to
^[A-Za-z0-9_. @+-]+$(allow space) or documenting that authors must be single tokens.
4. Anchoring inconsistency
You changed _VERSION_REGEX, _DATE_REGEX, _UUID_REGEX, _FINGERPRINT_REGEX, _MITRE_ATT_REGEX from $ to \Z (good — $ accepts trailing \n in default mode), but _AUTHOR_REGEX still ends with $. So "jdoe@CCCS\n" will pass author validation while "1.0\n" fails version validation. Suggest making author \Z too for consistency.
5. MemoryManager.flush() queue accounting (memory_manager.py)
The BackendClosedError branch returns without task_done(), which will leave _enrichment_queue.unfinished_tasks permanently > 0 and break any downstream queue.join(). Either decrement or move the task_done() to a try/finally.
6. _apply_reasoning_model_scaling doubles already-tuned values
If an operator already set max_tokens_causal: 16000 in YAML and toggles reasoning_model: true, the scaler bumps to 32000 — likely unintended. Suggest scaling against the defaults rather than the current value, or only bumping if current < reasoning_floor.
7. Scope creep: tests/test_vector_memory.py
261-line test file unrelated to LLM budgets / bulk ingest / CCCS regex. It exercises existing behavior without touching VectorMemory itself, so it's a backfill, not a regression test for this PR. Fine to keep, but the PR description should mention it.
8. Docstring de-Sphinxification
A bunch of docstrings lost :class:, :meth:, and double-backtick markup (sigma/ingest.py, others). MkDocs / docs build will render these as plain text. Not blocking but worth a single doc-formatting follow-up.
Verified clean
- Env-var plumbing for all 6 new fields is correct and consistent (
int()parse with warning on bad value). LLMConfig.__repr__is updated to include the new fields without leaking secrets.- All 5 production call-site swaps (note_constructor, synthesis_generator, fact_extractor, entity_indexer, memory_evolver) read from
cfg.llm.max_tokens_*rather than literals. - SEC-6 fix (
fullmatchinstead ofsearch) is correct and the test cases cover the multiline-injection vector. - New
tests/*ruff ignore (S101, RUF012) is the standard pattern.
LGTM to merge after a green CI run; the items above can land in a follow-up.
Generated by Claude Code
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Signed-off-by: Patrick Roland <48327651+rolandpg@users.noreply.github.com>
Implements all three P0 items for v2.7.0.
Issue #125 — LLM budget plumbing
max_tokensper call site (causal/synthesis/extraction/ner/evolve)strip_thinking_tags()injson_parse.pyto remove `` blocksreasoning_model: boolauto-scaling flag (doubles budgets, >=180s timeout)tests/test_max_tokens_budgets.pyIssue #72 — Bulk ingest bottleneck
bulk=Truepath onyara/ingestandsigma/ingest(sync=False + flush())Plyarasingleton inyara/parser.py(grammar compiled once)MemoryManager.flush()method for batch enrichment drainIssue #73 — CCCS regex hardening (SEC-6/SEC-7)
_regex_matchusesfullmatch()to prevent multiline injectiontests/test_cccs_metadata.pyTesting