Skip to content

fix: P0 items — LLM budget plumbing, bulk ingest, CCCS regex hardening#146

Open
rolandpg wants to merge 5 commits into
masterfrom
fix/p0-llm-budgets-bulk-ingest-cccs
Open

fix: P0 items — LLM budget plumbing, bulk ingest, CCCS regex hardening#146
rolandpg wants to merge 5 commits into
masterfrom
fix/p0-llm-budgets-bulk-ingest-cccs

Conversation

@rolandpg
Copy link
Copy Markdown
Owner

Implements all three P0 items for v2.7.0.

Issue #125 — LLM budget plumbing

  • Config-overridable max_tokens per call site (causal/synthesis/extraction/ner/evolve)
  • strip_thinking_tags() in json_parse.py to remove `` blocks
  • reasoning_model: bool auto-scaling flag (doubles budgets, >=180s timeout)
  • Env var + config.yaml support for all new fields
  • Regression tests at tests/test_max_tokens_budgets.py

Issue #72 — Bulk ingest bottleneck

  • bulk=True path on yara/ingest and sigma/ingest (sync=False + flush())
  • Cached Plyara singleton in yara/parser.py (grammar compiled once)
  • MemoryManager.flush() method for batch enrichment drain

Issue #73 — CCCS regex hardening (SEC-6/SEC-7)

  • _regex_match uses fullmatch() to prevent multiline injection
  • Author regex tightened to safe printable ASCII
  • Negative test cases in tests/test_cccs_metadata.py

Testing

…x hardening

Issue #125 — Harden reasoning-model LLM budget plumbing:
- Config-overridable max_tokens per call site
  (max_tokens_causal/synthesis/extraction/ner/evolve)
- strip_thinking_tags() in json_parse.py to remove  thinking...  blocks
- reasoning_model: bool auto-scaling flag
- Env var + config.yaml support for all new fields
- Regression tests at tests/test_max_tokens_budgets.py

Issue #72 — remember(sync=True) bulk ingest bottleneck:
- bulk=True path on yara/ingest and sigma/ingest (sync=False + flush())
- Cached Plyara singleton in yara/parser.py (grammar compiled once)
- MemoryManager.flush() method for batch enrichment drain

Issue #73 — CCCS regex hardening (SEC-6/SEC-7):
- _regex_match uses fullmatch() to prevent multiline injection
- Author regex tightened to safe printable ASCII
- Negative test cases in tests/test_cccs_metadata.py

All existing tests pass (125+).
Closes #125, closes #72, closes #73.
Copilot AI review requested due to automatic review settings April 28, 2026 04:44
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: b8cbc50d78

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread src/zettelforge/memory_manager.py
Comment thread src/zettelforge/yara/parser.py
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Implements the v2.7.0 P0 follow-ups across LLM budget configuration, bulk ingest performance, and CCCS metadata regex hardening.

Changes:

  • Adds per-call-site max_tokens_* config knobs plus reasoning_model auto-scaling and <think>/<thinking> stripping during JSON extraction.
  • Speeds up bulk YARA/Sigma ingest via bulk=True (async enrichment + a final MemoryManager.flush()), and caches Plyara to avoid repeated grammar compilation.
  • Hardens CCCS metadata validation by tightening the author regex and switching regex checks to fullmatch() to prevent multiline-injection bypasses.

Reviewed changes

Copilot reviewed 16 out of 17 changed files in this pull request and generated 16 comments.

Show a summary per file
File Description
tests/test_vector_memory.py Adds VectorMemory tests (currently inconsistent with the implementation’s API).
tests/test_max_tokens_budgets.py Adds regression tests for budget defaults/env overrides and think-tag stripping.
tests/test_cccs_metadata.py Adds negative/positive tests for CCCS regex hardening.
src/zettelforge/yara/parser.py Introduces cached Plyara instance and clears per parse.
src/zettelforge/yara/ingest.py Adds bulk option to defer enrichment and flush once at end.
src/zettelforge/yara/cccs_metadata.py Tightens _AUTHOR_REGEX and uses fullmatch() in _regex_match.
src/zettelforge/synthesis_generator.py Plumbs cfg.llm.max_tokens_synthesis into generate().
src/zettelforge/sigma/ingest.py Adds sync plumb-through and bulk mode with optional flush.
src/zettelforge/note_constructor.py Plumbs cfg.llm.max_tokens_causal into generate().
src/zettelforge/memory_manager.py Adds flush() for synchronous enrichment draining after async ingest.
src/zettelforge/memory_evolver.py Plumbs cfg.llm.max_tokens_evolve into generate() (incl. retry).
src/zettelforge/json_parse.py Adds strip_thinking_tags() and applies it before fence stripping.
src/zettelforge/fact_extractor.py Plumbs cfg.llm.max_tokens_extraction into generate().
src/zettelforge/entity_indexer.py Plumbs cfg.llm.max_tokens_ner into generate() (incl. retry).
src/zettelforge/config.py Adds new LLM budget fields, env overrides, and reasoning-model scaling.
config.default.yaml Documents new env vars and YAML config keys for budgets/scaling.
src/zettelforge/scripts/init.py Part of the PR file set (no diff shown in provided context).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread tests/test_vector_memory.py
Comment thread tests/test_vector_memory.py
Comment thread tests/test_cccs_metadata.py
Comment thread src/zettelforge/yara/parser.py
Comment thread tests/test_vector_memory.py Outdated
Comment on lines +19 to +52
def test_note_constructor_causal_budget(self):
"""note_constructor.extract_causal_triples: max_tokens >= 8000."""
cfg = get_config()
assert cfg.llm.max_tokens_causal >= 8000, (
f"Expected max_tokens_causal >= 8000, got {cfg.llm.max_tokens_causal}"
)

def test_synthesis_generator_budget(self):
"""synthesis_generator._generate_synthesis: max_tokens >= 2500."""
cfg = get_config()
assert cfg.llm.max_tokens_synthesis >= 2500, (
f"Expected max_tokens_synthesis >= 2500, got {cfg.llm.max_tokens_synthesis}"
)

def test_fact_extractor_budget(self):
"""fact_extractor.extract: max_tokens >= 2500."""
cfg = get_config()
assert cfg.llm.max_tokens_extraction >= 2500, (
f"Expected max_tokens_extraction >= 2500, got {cfg.llm.max_tokens_extraction}"
)

def test_entity_indexer_ner_budget(self):
"""entity_indexer.extract_llm: max_tokens >= 2500."""
cfg = get_config()
assert cfg.llm.max_tokens_ner >= 2500, (
f"Expected max_tokens_ner >= 2500, got {cfg.llm.max_tokens_ner}"
)

def test_memory_evolver_budget(self):
"""memory_evolver.evaluate_evolution: max_tokens >= 2500."""
cfg = get_config()
assert cfg.llm.max_tokens_evolve >= 2500, (
f"Expected max_tokens_evolve >= 2500, got {cfg.llm.max_tokens_evolve}"
)
Copy link

Copilot AI Apr 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These tests only assert config defaults/overrides, but they don't verify that each call site actually passes the configured max_tokens_* value into llm_client.generate(). To catch regressions in the “plumbing”, consider mocking generate in each module and asserting the max_tokens argument matches the corresponding cfg.llm.max_tokens_* field.

Copilot uses AI. Check for mistakes.
Comment thread src/zettelforge/json_parse.py Outdated
Comment on lines +17 to +31
def strip_thinking_tags(text: str) -> str:
"""Strip ******** ******** thinking tags from LLM output.

Reasoning models wrap internal reasoning in ******** ... ********
tags. These must be removed before JSON extraction, otherwise the
regex searches match the ******** tag text instead of the JSON
payload.

Args:
text: Raw LLM output that may contain ******** ... ******** tags.

Returns:
Cleaned text with **thinking**/** and <thinking>/** blocks removed.
"""
return re.sub(r"(?:\*\*thinking\*\*.*?\*\*|<think(?:ing)?>.*?</think(?:ing)?>)", "", text, flags=re.DOTALL)
Copy link

Copilot AI Apr 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The docstring/comments for strip_thinking_tags() contain placeholder ******** text (and the Returns description is malformed). Replace this with the actual tag forms being stripped (e.g. <think>...</think>, <thinking>...</thinking>, and/or any markdown variants) so future readers understand what the function does.

Copilot uses AI. Check for mistakes.
Comment on lines +4 to +5
import time
from unittest.mock import patch, MagicMock
Copy link

Copilot AI Apr 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unused imports time and MagicMock will trigger Ruff F401 in tests (repo enables F rules). Remove them or use them in the tests.

Suggested change
import time
from unittest.mock import patch, MagicMock
from unittest.mock import patch

Copilot uses AI. Check for mistakes.
Comment on lines +236 to +239
# Search for it
results = vm.search("special chars", k=5)
assert len(results) >= 1
assert special_text in results[0]["text"]
Copy link

Copilot AI Apr 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

VectorMemory.search() takes top_k (not k). Update this call to use top_k to avoid TypeError.

Copilot uses AI. Check for mistakes.
Comment on lines +254 to +257
# Search should still work
results = vm.search("memory number 5", k=5)
assert len(results) >= 1
assert "memory number 5" in results[0]["text"]
Copy link

Copilot AI Apr 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

VectorMemory.search() takes top_k (not k). Update this call to use top_k to avoid TypeError.

Copilot uses AI. Check for mistakes.
…g newlines

Python's $ anchor matches before an optional trailing \n, allowing
injection like '1.0\n'. \Z matches strictly at end of string.
Fixes test_cccs_metadata on 3.12/3.13 CI runners.
Copy link
Copy Markdown
Owner Author

@rolandpg rolandpg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review summary

Scope is clean and matches the description: per-call-site max_tokens plumbing through LLMConfig, MemoryManager.flush() for bulk ingest, Plyara singleton caching, CCCS regex hardening (SEC-6/SEC-7), and the <think>-tag stripper. Tests for all three areas are included. I'm comfortable with this landing — the items below are non-blocking for the merge but worth a follow-up.

Blocking? No. Worth fixing in a follow-up:

1. strip_thinking_tags regex (json_parse.py)

r"(?:\*\*thinking\*\*.*?\*\*|<think(?:ing)?>.*?</think(?:ing)?>)"

The first alternative (**thinking**...**) doesn't match any reasoning model I'm aware of — DeepSeek-R1 / QwQ / Qwen3.5-thinking all emit <think>...</think> or <thinking>...</thinking>. The **...** close also isn't anchored, so it could greedily chew through legitimate markdown bold inside a JSON string value. Tests only cover the <think>/<thinking> paths, so the dead branch is also untested. Recommend either dropping the **thinking** alternative or adding a passing test that proves it's needed.

2. Plyara singleton caching is not thread-safe (yara/parser.py)
The comment says "parse_string() is thread-safe — it creates fresh parser objects per call and does not mutate shared instance state — so a single cached instance is safe to reuse", then on the very next call site the code does parser.clear() to flush accumulated state from a prior call. Two concurrent callers will race: thread A's clear() can wipe state mid-parse from thread B's parse_string(). Bulk ingest is currently single-threaded so this isn't observable today, but the comment is misleading and the contract is fragile. Suggest either (a) a threading.Lock around _get_plyara() + clear() + parse_string(), or (b) threading.local() for the cached instance.

3. CCCS author regex tightening is a behavior change (yara/cccs_metadata.py)
The old _AUTHOR_REGEX = ...|^[A-Z\s._\-]+$|^.*$ accepted multi-word authors with spaces ("CCCS Group") via the second alternative; the new ^[A-Za-z0-9_.@+-]+$ rejects them. The third alternative ^.*$ made the old regex a no-op so SEC-7 is the right call, but operators with existing rules whose author = "Some Org Name" will now fail strict-tier validation. Consider:

  • A short note in the v2.7.0 release notes calling this out, and
  • Either widening to ^[A-Za-z0-9_. @+-]+$ (allow space) or documenting that authors must be single tokens.

4. Anchoring inconsistency
You changed _VERSION_REGEX, _DATE_REGEX, _UUID_REGEX, _FINGERPRINT_REGEX, _MITRE_ATT_REGEX from $ to \Z (good — $ accepts trailing \n in default mode), but _AUTHOR_REGEX still ends with $. So "jdoe@CCCS\n" will pass author validation while "1.0\n" fails version validation. Suggest making author \Z too for consistency.

5. MemoryManager.flush() queue accounting (memory_manager.py)
The BackendClosedError branch returns without task_done(), which will leave _enrichment_queue.unfinished_tasks permanently > 0 and break any downstream queue.join(). Either decrement or move the task_done() to a try/finally.

6. _apply_reasoning_model_scaling doubles already-tuned values
If an operator already set max_tokens_causal: 16000 in YAML and toggles reasoning_model: true, the scaler bumps to 32000 — likely unintended. Suggest scaling against the defaults rather than the current value, or only bumping if current < reasoning_floor.

7. Scope creep: tests/test_vector_memory.py
261-line test file unrelated to LLM budgets / bulk ingest / CCCS regex. It exercises existing behavior without touching VectorMemory itself, so it's a backfill, not a regression test for this PR. Fine to keep, but the PR description should mention it.

8. Docstring de-Sphinxification
A bunch of docstrings lost :class:, :meth:, and double-backtick markup (sigma/ingest.py, others). MkDocs / docs build will render these as plain text. Not blocking but worth a single doc-formatting follow-up.

Verified clean

  • Env-var plumbing for all 6 new fields is correct and consistent (int() parse with warning on bad value).
  • LLMConfig.__repr__ is updated to include the new fields without leaking secrets.
  • All 5 production call-site swaps (note_constructor, synthesis_generator, fact_extractor, entity_indexer, memory_evolver) read from cfg.llm.max_tokens_* rather than literals.
  • SEC-6 fix (fullmatch instead of search) is correct and the test cases cover the multiline-injection vector.
  • New tests/* ruff ignore (S101, RUF012) is the standard pattern.

LGTM to merge after a green CI run; the items above can land in a follow-up.


Generated by Claude Code

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Signed-off-by: Patrick Roland <48327651+rolandpg@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

2 participants