Skip to content

feat(capture): add S1 app parser registry with golden fixture tests#15

Open
heming-gmh wants to merge 2 commits intoEinsia:mainfrom
heming-gmh:feat/s1-parser-registry-golden-tests
Open

feat(capture): add S1 app parser registry with golden fixture tests#15
heming-gmh wants to merge 2 commits intoEinsia:mainfrom
heming-gmh:feat/s1-parser-registry-golden-tests

Conversation

@heming-gmh
Copy link
Copy Markdown

@heming-gmh heming-gmh commented Apr 26, 2026

Summary

Introduce a lightweight, baseline-plus-patches parser registry for the S1 capture enrichment layer, so the project can grow from one monolithic module (s1_parser.py) into a family of small, independently-testable app parsers — without changing a single byte of downstream behaviour.

Why

The current s1_parser.enrich(capture) merges three responsibilities in one module: generic AX→S1 extraction, browser URL extraction, and (on another branch) editor/terminal metadata. This works for a small surface area, but it does not scale as the project adds parsers for Slack, Linear, Office, Figma, Notion, terminals, and other high-value contexts.

This PR splits infrastructure from implementation:

The Baseline + Patches Model

  1. enrich() always computes a generic baseline (focused element, visible text, url=None).
  2. Registered AppParser implementations then run in priority order.
  3. Each parser may return an S1Patch that selectively overrides fields.
  4. Later parsers can compose on top of earlier ones — e.g. a Linear parser can match linear.app in the URL that the browser parser already extracted.

This is intentionally not a first-match-wins dispatch. Composition lets an app that is both a browser AND a special-purpose tool (e.g. a browser-based IDE) get both URL extraction AND app-specific metadata.

Benefits

  • Reviewable parsers: golden fixture tests (input.jsonexpected.json) let maintainers inspect one raw AX input and one expected S1 output for each parser PR.
  • Fault isolation: a broken parser never takes down the whole capture pipeline — exceptions are logged and the baseline fields still ship.
  • Zero downstream churn: focused_element, visible_text, and url are unchanged; timeline, FTS indexing, and dedup hashing all work without modification.
  • Low ceremony for contributors: a future parser author only needs to implement AppParser (4 members), register it, add a fixture directory, and open a PR.

What Changed

New files

File Purpose
app_parsers/__init__.py Parser registry — register(), apply_parsers(), builtin auto-registration
app_parsers/base.py FocusedElement, ParseContext, S1Fields, S1Patch, AppParser Protocol
app_parsers/browser.py BrowserParser — migrated from s1_parser._extract_url, behaviour identical
tests/test_s1_registry.py 5 unit tests: priority ordering, patch composition, exception resilience, app_context write semantics
tests/test_s1_golden.py Parametrised golden fixture runner (4 fixtures)
tests/test_s1_integration.py 4 integration tests: scheduler+enrich+FTS pipeline, timeline _format_events rendering
tests/fixtures/s1/*/ 4 fixture directories (chrome_url, safari_bare_domain, generic_cursor_textarea, non_browser_url_textfield)

Modified files

File Change
s1_parser.py Remove _BROWSER_BUNDLES, _extract_url; rewrite enrich() as baseline + apply_parsers(); re-export FocusedElement from base.py
tests/conftest.py Add autouse fixture to reset parser registry before each test

Not in this PR

  • No Slack, Linear, Office, editor, or terminal parser behaviour.
  • No dynamic plugin loading or entry-point discovery.
  • No schema_version change.
  • No AppleScript, browser automation, OCR, or new permissions.

Test Plan

  • 9 existing test_s1_parser.py tests pass unchanged
  • 5 registry unit tests cover ordering, composition, error isolation, app_context semantics
  • 4 golden fixture tests cover browser + non-browser URL extraction scenarios
  • 4 integration tests cover scheduler→enrich→FTS and timeline _format_events contracts
  • Full test suite: 82/82 pass

Backwards Compatibility

enrich(capture) produces exactly the same focused_element, visible_text, and url values as before. Downstream consumers (timeline aggregator, FTS indexer, content-dedup hasher) require zero changes.

Introduce a lightweight, baseline-plus-patches parser registry for the
S1 capture enrichment layer.  app-parser logic is now isolated behind
a simple protocol (AppParser) and composed in priority order, so
future app-specific parsers can be contributed one at a time without
touching the core enrichment code.

The generic baseline always produces focused_element, visible_text,
and url=None.  Registered parsers then run in priority order and may
selectively override fields via S1Patch.  This lets parsers compose —
for example a future Linear parser can match linear.app in the URL
that the browser parser already extracted.

Key changes:
- Move FocusedElement and parser base types to app_parsers/base.py
- Move browser URL extraction into BrowserParser (app_parsers/browser.py)
- Wire enrich() to run baseline + apply_parsers() from registry
- Add 4 golden fixture tests (chrome_url, safari_bare_domain,
  generic_cursor_textarea, non_browser_url_textfield)
- Add 5 registry unit tests (priority, composition, exception
  resilience, app_context write semantics)
- Add 4 integration tests (scheduler + FTS, timeline _format_events)
- autouse conftest fixture to reset registry between tests

Zero behaviour change to enrich() output.  All 82 existing + new tests
pass.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a modular app parser registry for the S1 enrichment pipeline, refactoring the existing logic into a more extensible architecture. It migrates browser-specific URL extraction to a dedicated parser and adds a comprehensive suite of unit, integration, and golden fixture tests. Feedback suggests implementing a deep merge for "app_context" to support better parser composition and switching to "read_bytes()" for JSON loading in tests to ensure robust encoding handling.

Comment thread src/openchronicle/capture/app_parsers/__init__.py
Comment thread tests/test_s1_golden.py Outdated
Comment thread tests/test_s1_integration.py Outdated
Comment thread tests/test_s1_integration.py Outdated
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant