feat(capture): add S1 app parser registry with golden fixture tests#15
Open
heming-gmh wants to merge 2 commits intoEinsia:mainfrom
Open
feat(capture): add S1 app parser registry with golden fixture tests#15heming-gmh wants to merge 2 commits intoEinsia:mainfrom
heming-gmh wants to merge 2 commits intoEinsia:mainfrom
Conversation
Introduce a lightweight, baseline-plus-patches parser registry for the S1 capture enrichment layer. app-parser logic is now isolated behind a simple protocol (AppParser) and composed in priority order, so future app-specific parsers can be contributed one at a time without touching the core enrichment code. The generic baseline always produces focused_element, visible_text, and url=None. Registered parsers then run in priority order and may selectively override fields via S1Patch. This lets parsers compose — for example a future Linear parser can match linear.app in the URL that the browser parser already extracted. Key changes: - Move FocusedElement and parser base types to app_parsers/base.py - Move browser URL extraction into BrowserParser (app_parsers/browser.py) - Wire enrich() to run baseline + apply_parsers() from registry - Add 4 golden fixture tests (chrome_url, safari_bare_domain, generic_cursor_textarea, non_browser_url_textfield) - Add 5 registry unit tests (priority, composition, exception resilience, app_context write semantics) - Add 4 integration tests (scheduler + FTS, timeline _format_events) - autouse conftest fixture to reset registry between tests Zero behaviour change to enrich() output. All 82 existing + new tests pass. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
There was a problem hiding this comment.
Code Review
This pull request introduces a modular app parser registry for the S1 enrichment pipeline, refactoring the existing logic into a more extensible architecture. It migrates browser-specific URL extraction to a dedicated parser and adds a comprehensive suite of unit, integration, and golden fixture tests. Feedback suggests implementing a deep merge for "app_context" to support better parser composition and switching to "read_bytes()" for JSON loading in tests to ensure robust encoding handling.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
7 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Introduce a lightweight, baseline-plus-patches parser registry for the S1 capture enrichment layer, so the project can grow from one monolithic module (
s1_parser.py) into a family of small, independently-testable app parsers — without changing a single byte of downstream behaviour.Why
The current
s1_parser.enrich(capture)merges three responsibilities in one module: generic AX→S1 extraction, browser URL extraction, and (on another branch) editor/terminal metadata. This works for a small surface area, but it does not scale as the project adds parsers for Slack, Linear, Office, Figma, Notion, terminals, and other high-value contexts.This PR splits infrastructure from implementation:
The Baseline + Patches Model
enrich()always computes a generic baseline (focused element, visible text,url=None).AppParserimplementations then run in priority order.S1Patchthat selectively overrides fields.linear.appin the URL that the browser parser already extracted.This is intentionally not a first-match-wins dispatch. Composition lets an app that is both a browser AND a special-purpose tool (e.g. a browser-based IDE) get both URL extraction AND app-specific metadata.
Benefits
input.json→expected.json) let maintainers inspect one raw AX input and one expected S1 output for each parser PR.focused_element,visible_text, andurlare unchanged; timeline, FTS indexing, and dedup hashing all work without modification.AppParser(4 members), register it, add a fixture directory, and open a PR.What Changed
New files
app_parsers/__init__.pyregister(),apply_parsers(), builtin auto-registrationapp_parsers/base.pyFocusedElement,ParseContext,S1Fields,S1Patch,AppParserProtocolapp_parsers/browser.pyBrowserParser— migrated froms1_parser._extract_url, behaviour identicaltests/test_s1_registry.pyapp_contextwrite semanticstests/test_s1_golden.pytests/test_s1_integration.py_format_eventsrenderingtests/fixtures/s1/*/chrome_url,safari_bare_domain,generic_cursor_textarea,non_browser_url_textfield)Modified files
s1_parser.py_BROWSER_BUNDLES,_extract_url; rewriteenrich()as baseline +apply_parsers(); re-exportFocusedElementfrombase.pytests/conftest.pyautousefixture to reset parser registry before each testNot in this PR
schema_versionchange.Test Plan
test_s1_parser.pytests pass unchangedapp_contextsemantics_format_eventscontractsBackwards Compatibility
enrich(capture)produces exactly the samefocused_element,visible_text, andurlvalues as before. Downstream consumers (timeline aggregator, FTS indexer, content-dedup hasher) require zero changes.