RFC-0001 OSINT Layer — scaffold (Phase 1–5)#147
Conversation
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 830156f979
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
There was a problem hiding this comment.
Pull request overview
Adds a new zettelforge.osint subpackage that scaffolds the RFC-0001 OSINT Layer: an OSINT ontology (entities/edges), a self-registering async collector registry, canonical entity key normalization, and an in-memory “Investigation” case-scoping API, plus tests and a short RFC doc.
Changes:
- Introduces OSINT ontology declarations (Phase 1–5) and an async collector registry (
TRANSFORM_REGISTRY) with multiple collector stubs. - Adds canonicalization utilities (IPv4/domain/phone/ASN/netblock) and an in-memory investigation store with JSON export.
- Adds initial test coverage for ontology shapes, registry presence, and canonicalization behavior; adds RFC documentation.
Reviewed changes
Copilot reviewed 29 out of 29 changed files in this pull request and generated 35 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/test_osint_entities.py | Tests OSINT entity/edge schema structure + canonicalization expectations. |
| tests/test_osint_collectors.py | Tests transform registry population, collector async signatures, and investigation API. |
| src/zettelforge/osint/transform_registry.py | Defines TransformRegistry + TransformEntry and global singleton registry. |
| src/zettelforge/osint/ontology.py | Declares Phase 1–5 OSINT entity/edge types and combined ONTOLOGY. |
| src/zettelforge/osint/investigation.py | Adds in-memory Investigation model + CRUD-like helpers + export. |
| src/zettelforge/osint/entity_resolver.py | Adds canonicalization functions + alias indexing + KG insertion helper. |
| src/zettelforge/osint/collectors/init.py | Collectors package entrypoint (re-exports registry). |
| src/zettelforge/osint/collectors/infrastructure/init.py | Infrastructure collectors package marker (re-exports registry). |
| src/zettelforge/osint/collectors/infrastructure/dns_collector.py | DNS A/AAAA/NS/MX/TXT/PTR collectors (dnspython-based). |
| src/zettelforge/osint/collectors/infrastructure/whois_collector.py | WHOIS collectors (domain + IP). |
| src/zettelforge/osint/collectors/infrastructure/cert_collector.py | crt.sh certificate transparency collector. |
| src/zettelforge/osint/collectors/infrastructure/bgp_collector.py | BGPView-based ASN/IP lookup collectors. |
| src/zettelforge/osint/collectors/infrastructure/port_scanner.py | nmap-based active port scan collector stub. |
| src/zettelforge/osint/collectors/people/init.py | People collectors package marker (re-exports registry). |
| src/zettelforge/osint/collectors/people/hunter_collector.py | Hunter.io collectors for domain/email enrichment. |
| src/zettelforge/osint/collectors/people/holehe_collector.py | holehe-based email-to-account enumeration collector. |
| src/zettelforge/osint/collectors/people/namechk_collector.py | namechk.com username availability collector. |
| src/zettelforge/osint/collectors/tech/init.py | Tech collectors package marker (re-exports registry). |
| src/zettelforge/osint/collectors/tech/wappalyzer_collector.py | Wappalyzer API / header-fingerprint collector stub. |
| src/zettelforge/osint/collectors/tech/builtwith_collector.py | BuiltWith API collector stub. |
| src/zettelforge/osint/collectors/social/init.py | Social collectors package marker (re-exports registry). |
| src/zettelforge/osint/collectors/social/twitter_collector.py | Twitter/X recent search collector stub. |
| src/zettelforge/osint/collectors/social/hashtag_tracker.py | Hashtag tracking collector stub. |
| src/zettelforge/osint/collectors/breach/init.py | Breach collectors package marker (re-exports registry). |
| src/zettelforge/osint/collectors/breach/hibp_collector.py | HaveIBeenPwned breach lookup collector stub. |
| src/zettelforge/osint/collectors/breach/breach_directory.py | Paste-site monitoring collector stub. |
| src/zettelforge/osint/init.py | OSINT package entrypoint exporting ONTOLOGY. |
| src/zettelforge/init.py | Adds OSINT ontology re-export (currently broken due to __all__ ordering). |
| docs/rfc-0001-osint-layer.md | RFC overview doc for the OSINT layer scaffold. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Replaces the original Phase 1-5 scaffold commit (830156f) with an implementation that integrates with the existing ontology validator and ships working Phase 1 collectors with mocked tests. What changed vs the original scaffold: - ontology.py: Phase 1-5 entity / edge declarations, all using the from_types/to_types/cardinality shape that OntologyValidator actually validates against. The scaffold's "from": "*", "to": "A|B" syntax was invisible to the validator and silently waved every relation through. - ontology.py: canonicalization helpers (asn, cidr, domain, ipv6, mx, port, url, web_title) used by collectors and tests. - ontology.py: merge_into_global_ontology() — idempotent merge into core ENTITY_TYPES / RELATION_TYPES at import time, no schema migration. - transform_registry.py: function-based register(metadata, fn) API, CollectorTuple NamedTuple, idempotent re-registration. Scaffold used a decorator pattern with no idempotency. - collectors are sync (matches existing yara/sigma pipelines; the codebase has zero asyncio usage today). Scaffold used async def. - Phase 1 collectors are functional with mocked tests: dns_collector — A/AAAA/NS/MX with NXDOMAIN/Timeout absorbed whois_collector — domain via python-whois, IP via ipwhois RDAP (graceful ImportError handling per AGENTS.OE Override 4 — surface failures, no silent retry) cert_collector — crt.sh SAN enumeration, dedup, HTTP error → [] - Phase 1.5 collectors (bgp_collector, port_scanner) and Phase 2-5 stubs (Hunter, Holehe, Namechk, Wappalyzer, BuiltWith, Twitter, Hashtag, HIBP, Breach Directory) register their metadata so discovery works, return [] until their integration ships. port_scanner is gated behind ZETTELFORGE_OSINT_ACTIVE_SCAN=1. - IPv6Address added to core ENTITY_TYPES (parity with IPv4Address). - Top-level __init__.py: side-effect import of osint subpackage. The original scaffold's __init__ called __all__ += [...] before __all__ was defined — would have raised NameError on import. - Tests: 67 mocked tests in tests/test_osint_entities.py + tests/test_osint_collectors.py covering entity validation, edge validation, canonicalization, collector shape, registry dispatch. - docs/rfcs/RFC-016-osint-layer.md: canonical RFC with Status block documenting the three deviations from the literal RFC text (single kg_nodes table, no CLI in this PR, sync collectors). - SCOPING_DOC.md: Phase 1 planning notes. - docs/rfc-0001-osint-layer.md (scaffold) removed; docs/rfcs/RFC-016 is the canonical doc. Verification: - 67/67 osint tests pass (0.12s) - 121/121 focused subset (basic + kg_edge_schema + extensions + consolidation + osint) pass (3.71s) - ruff check src/zettelforge: clean - ruff format --check src/zettelforge: clean - Smoke import: 14 collectors registered, all entity/edge types merged. Out of scope (deferred): - Real BGPView / Wappalyzer / Hunter / etc. integrations (Phase 1.5+) - Investigation workflow engine + state machine (Phase 4) - Top-level zettelforge CLI (Phase 4 brings it alongside the workflow engine) - Container packaging (RFC §10 already defers to vNext per docs/03) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
830156f to
ecb3f12
Compare
Addresses the two CI failures from ecb3f12: - 3 dns_collector tests failed with ModuleNotFoundError: No module named 'dns'. The tests use dns.resolver.NXDOMAIN to construct realistic mocks; that requires dnspython to actually be installed even though the collector calls themselves are mocked. Adds a new [osint] optional dependency with dnspython, python-whois, and ipwhois (the runtime Phase 1 collectors need them anyway), and pulls it into [dev] so CI's pip install -e .[dev] resolves them. - Total coverage was 66.74% vs 67% required (GOV-007). The OSINT layer added ~720 statements with most stub branches uncovered. Adds three parametrized tests over every registered collector: 1. unsupported input returns a list (covers early-return) 2. metadata is well-formed (covers TransformMetadata fields) 3. every declared input_type is callable without raising That is 14 collectors x 3 tests = 42 new tests, all using mocked or short-circuited paths (no network, no API keys). Local results: - 109/109 osint tests pass (was 67; +42 from parametrized smoke tests) - ruff check src/zettelforge/osint: clean - OSINT package coverage: 70% (was untested for stubs) No force-push needed; this is a follow-up commit on rfc/osint-layer-scaffold. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The new universal smoke test test_every_collector_handles_each_declared_input_type[whois_collector] exercised whois_collector with the canonical RFC 5737 documentation IP 192.0.2.1, which ipwhois treats as a poisoned input and raises IPDefinedError before any network call. The collector wasn't catching that exception, so the bare-bones probe call propagated it and CI failed again on test (3.12) and test (3.13). Real callers will hit the same exception any time an analyst feeds in a private (10/8, 192.168/16), loopback (127/8), or documentation-net IP. The right behaviour is fail-closed: log and return None / [] so the KG doesn't pretend it has data it doesn't. This commit catches IPDefinedError and BaseIpwhoisException in _lookup_ip, logging the reserved-IP case at debug and the broader failure case at warning. Per AGENTS.OE Override 4, the failure is surfaced (not silently retried) — it just doesn't bubble as an exception. Local: 109/109 osint tests pass; ruff clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Addresses reviewer concerns on PR #147 that survived the amend (the rest were against scaffold code my amend already replaced). bgp_collector edge direction inverted: The ontology declares part_of_as as Netblock -> ASNumber (with IP families also valid as the from side). The collector was emitting from_entity_type="ASNumber", to_entity_type="Netblock" which inverts the declared direction. validate_relation would have rejected those edges. Swap the from/to fields; the entity emitted is still the Netblock (semantically: "given an ASN, here are netblocks that are part_of_as that ASN"). Caught by Copilot review of the original scaffold; the bug carried over into my amend. entity_resolver canonicalise_ipv4 leading zeros: ipaddress.IPv4Address("001.002.003.004") raises AddressValueError on Python 3.10+. Replace with explicit octet parsing (split on '.', int() each, validate 0-255). Raises ValueError on malformed input instead of silently corrupting it. entity_resolver canonicalise_asn hex strip: Old behavior stripped non-digits, so "0x3039" became "AS03039" rather than the expected hex-decoded value. Delegate to zettelforge.osint.ontology.canonicalize_asn which uses int() and raises ValueError on non-decimal input — fail-loud. entity_resolver._canonical_key used "Domain" (typo, scaffold relic): The core ontology uses DomainName. Fix to match. Also added IPv6Address case (parity with IPv4Address). entity_resolver late `from datetime import datetime` at module bottom: Moved to module-top imports. Same module: phone canonicalization preserved as-is (E.164 best-effort heuristic; documented now). Netblock canonicalization upgraded to use canonicalize_cidr so host bits get trimmed. Local: 109/109 osint tests pass; ruff clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Reviewer notes — addressing the bot reviewsThe 32 unresolved threads were filed against scaffold commit Live concerns now fixed:
No-longer-applicable concerns (scaffold code replaced wholesale by
I'll resolve the stale threads next. Re-reviews welcome on the live HEAD ( |
Summary
Implements the scaffold for RFC-0001: ZettelForge OSINT Layer.
29 files touched — all new; no existing logic modified.
What changed
New entity types (Phase 1–5)
New edge types
New files
Modified files
Reviewer notes