fix(yara): tighten CCCS metadata regexes (SEC-6 / SEC-7)#145
Conversation
SEC-6: _regex_match now uses p.fullmatch() instead of p.search() to prevent multiline injection bypass in sharing/category/status regexes. TLP:WHITE\nid: hostile_id no longer matches. SEC-7: Author regex tightened to safe printable ASCII range. Removed permissive ^.*$ catch-all fallback. Added targeted negative test cases for both findings. Closes #73.
There was a problem hiding this comment.
Pull request overview
This PR tightens CCCS YARA metadata validation in zettelforge.yara.cccs_metadata to address security review findings SEC-6 (multiline/substring bypass) and SEC-7 (overly permissive author pattern), and adds targeted regression tests.
Changes:
- Switch
_regex_matchfromre.search()tore.fullmatch()to prevent multiline/substring bypasses (SEC-6). - Tighten
_AUTHOR_REGEXto a restricted safe-character set and remove the permissive catch-all fallback (SEC-7). - Add
tests/test_cccs_metadata.pywith negative test cases covering both findings and end-to-end tier behavior.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
src/zettelforge/yara/cccs_metadata.py |
Tightens regex matching semantics (fullmatch) and restricts author metadata format. |
tests/test_cccs_metadata.py |
Adds regression tests for multiline injection and author regex tightening across tiers. |
Comments suppressed due to low confidence (1)
src/zettelforge/yara/cccs_metadata.py:90
- Several single-value validators use
re.Pattern.match()with^...$-anchored regexes (_AUTHOR_REGEX/_VERSION_REGEX/_DATE_REGEX/_UUID_REGEX/_FINGERPRINT_REGEX/_MITRE_ATT_REGEX). In Python,$can match before a terminal newline, so inputs like"1.0\n"or"CCCS\n"can still pass.match(). To fully close the newline edge case (and to align with SEC-6’s full-match intent), switch these checks to.fullmatch()(or change the patterns to use\Zinstead of$).
# SEC-7: Tightened from ^[a-zA-Z]+\@[A-Z]+$|^[A-Z\s._\-]+$|^.*$ to
# only permit safe printable ASCII characters.
_AUTHOR_REGEX = re.compile(r"^[A-Za-z0-9_.@+-]+$")
_VERSION_REGEX = re.compile(r"^\d+\.\d+$")
_DATE_REGEX = re.compile(r"^\d{4}-\d{2}-\d{2}$")
_UUID_REGEX = re.compile(r"^[0-9A-Za-z]{16,}$") # base62 UUID, generous lower bound.
_FINGERPRINT_REGEX = re.compile(r"^[a-fA-F0-9]{40,64}$") # SHA-1 / SHA-256-ish.
_MITRE_ATT_REGEX = re.compile(r"^(TA|T|M|G|S)\d{4}(\.\d{3})?$")
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| # From CCCS_YARA.yml: author's own regexExpression. | ||
| _AUTHOR_REGEX = re.compile(r"^[a-zA-Z]+\@[A-Z]+$|^[A-Z\s._\-]+$|^.*$") | ||
| # SEC-7: Tightened from ^[a-zA-Z]+\@[A-Z]+$|^[A-Z\s._\-]+$|^.*$ to | ||
| # only permit safe printable ASCII characters. |
There was a problem hiding this comment.
The comment says the author regex permits “safe printable ASCII characters”, but the actual character class is a much narrower subset ([A-Za-z0-9_.@+-]). Consider updating the wording to reflect the exact allowed set (or to say “safe ASCII subset”) to avoid over-claiming what’s permitted.
| # only permit safe printable ASCII characters. | |
| # only permit a safe ASCII subset: letters, digits, and _ . @ + -. |
| _MALWARE_TYPE_REGEXES, | ||
| _ACTOR_TYPE_REGEXES, | ||
| _HASH_REGEXES, |
There was a problem hiding this comment.
This import block includes several names that are never referenced in this test module (_MALWARE_TYPE_REGEXES, _ACTOR_TYPE_REGEXES, _HASH_REGEXES). With Ruff's F401 enabled for tests, this will fail lint; either remove these imports or add assertions that exercise them.
| _MALWARE_TYPE_REGEXES, | |
| _ACTOR_TYPE_REGEXES, | |
| _HASH_REGEXES, |
| NEGATIVE = ["", "1", "1.0.0", "1.", ".1", "a.b", "1.0\n"] | ||
|
|
||
| def test_positive(self) -> None: | ||
| for v in self.POSITIVE: | ||
| assert _VERSION_REGEX.match(v), f"Version should accept: {v!r}" |
There was a problem hiding this comment.
These tests use regex.match(...) with patterns ending in $ and expect values with a trailing newline (e.g. "1.0\n") to be rejected. In Python re, $ can match before a final newline, so .match() will still succeed here. Use .fullmatch() in these tests (and ideally in the validator) or change the regex to end with \Z so trailing newlines can’t slip through.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: b7d4c07e9f
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
|
|
||
| def test_negative(self) -> None: | ||
| for v in self.NEGATIVE: | ||
| assert not _VERSION_REGEX.match(v), f"Version should reject: {v!r}" |
There was a problem hiding this comment.
Use fullmatch for newline-rejection regex tests
These negative cases are asserting that strings like "1.0\n" do not match, but the test uses re.Pattern.match() and Python's $ anchor matches before a trailing newline, so ^...$ still passes for newline-terminated inputs. This makes the newly added SEC-6/SEC-7 test suite fail deterministically (same issue repeats in Date/UUID/Fingerprint/MITRE classes), so the commit cannot validate as written unless these checks switch to fullmatch() (or the newline cases are removed).
Useful? React with 👍 / 👎.
|
Closing in favor of #146, which is the superset: same SEC-6/SEC-7 CCCS regex hardening, plus the LLM token-budget plumbing (closes #125) and the bulk-ingest path (closes #72). Closes #73 there as well. If #146 has to be split for any reason, we can re-open this — but right now keeping both open just means whichever lands second will conflict on Generated by Claude Code |
Tightens CCCS metadata validation per security review findings SEC-6 and SEC-7.
_regex_matchnow usesp.fullmatch()instead ofp.search()to prevent multiline injection bypass^.*$catch-all fallbacktests/test_cccs_metadata.pywith negative test cases for both findingsCloses #73.