Skip to content

Add JSONL source set helper#752

Closed
mariusvniekerk wants to merge 1 commit into
provider-facade-corefrom
provider-jsonl-source-set
Closed

Add JSONL source set helper#752
mariusvniekerk wants to merge 1 commit into
provider-facade-corefrom
provider-jsonl-source-set

Conversation

@mariusvniekerk

Copy link
Copy Markdown
Collaborator

Adds a reusable JSONL source-set helper for the provider migration. The helper discovers stable sorted source refs, produces watch plans, maps changed paths back to sources, resolves persisted file-path hints or raw filename-stem IDs, and fingerprints source files with optional content hashes.

This helper is intentionally not a Provider. Migrated providers can compose it as a named field and forward only the source methods they support, keeping source behavior explicit while avoiding repeated JSONL discovery and lookup implementations.

@roborev-ci

roborev-ci Bot commented Jun 19, 2026

Copy link
Copy Markdown

roborev: Combined Review (7a61f4d)

Summary verdict: one medium-severity correctness issue should be addressed before merge.

Medium

  • internal/parser/jsonl_source_set.go:125FindSource ignores req.FingerprintKey even though SourceRef.FingerprintKey is the documented persisted lookup key and the helper exposes a FingerprintKey option. Providers using virtual display paths, or file paths that are not sufficient for lookup, can fail to resolve an existing source despite receiving the exact persisted fingerprint key.
    • Fix: try req.FingerprintKey as a path when applicable, and otherwise discover sources and compare against SourceRef.FingerprintKey before falling back to raw session ID matching.

Panel: ci_default_security | Synthesis: codex, 9s | Members: codex_default (codex/default, done, 2m55s), codex_security (codex/security, done, 42s) | Total: 3m46s

@mariusvniekerk mariusvniekerk force-pushed the provider-jsonl-source-set branch from 7a61f4d to 52c9e73 Compare June 19, 2026 18:45
@roborev-ci

roborev-ci Bot commented Jun 19, 2026

Copy link
Copy Markdown

roborev: Combined Review (52c9e73)

No Medium, High, or Critical findings to report.

Both review outputs are clean at the requested severity threshold; the only findings reported were Low severity and have been omitted.


Panel: ci_default_security | Synthesis: codex, 5s | Members: codex_default (codex/default, done, 3m38s), codex_security (codex/security, done, 38s) | Total: 4m21s

@roborev-ci

roborev-ci Bot commented Jun 20, 2026

Copy link
Copy Markdown

roborev: Review Unavailable (fa9cef2)

The review agent repeatedly failed to run (likely an agent or configuration error). roborev will try again on the next commit.

Last error: agent: claude-code failed stream: stream errors: You've hit your session limit · resets 5:50am (UTC): exit status 1

@mariusvniekerk mariusvniekerk force-pushed the provider-jsonl-source-set branch from fa9cef2 to 611917e Compare June 21, 2026 00:40
@roborev-ci

roborev-ci Bot commented Jun 21, 2026

Copy link
Copy Markdown

roborev: Combined Review (611917e)

No issues found.


Panel: ci_default_security | Synthesis: codex | Members: codex_default (codex/default, done, 5m47s), codex_security (codex/security, done, 1m7s) | Total: 6m54s

@mariusvniekerk

mariusvniekerk commented Jun 21, 2026

Copy link
Copy Markdown
Collaborator Author

@roborev-ci

roborev-ci Bot commented Jun 24, 2026

Copy link
Copy Markdown

roborev: Combined Review (216e363)

Synthesis unavailable. Showing individual review outputs.

claude-code — default (done)

I have verified the referenced facade types (SourceRef, SourceFingerprint, WatchRoot, ChangedPathRequest, FindSourceRequest, AgentType), the agent constants (AgentCodex, AgentGptme), and the helpers (isDirOrSymlink, IsValidSessionID) all exist with the expected semantics. The control flow (changed-path fallback classification, root scoping, dedup, fingerprinting) is correct and well-covered. A few minor notes follow.

Review Findings

Severity: Low
Location: internal/parser/jsonl_source_set.goshouldDescend / JSONLSourceSetOptions.FollowSymlinkDirs
Problem: The FollowSymlinkDirs option introduces a non-trivial code path (symlink resolution via isDirOrSymlink, then recursive descent into the symlinked directory). None of the tests exercise this option — every test leaves FollowSymlinkDirs false, so the symlink-descent branch and its interaction with pathIsUnderRoot/sourceForPath (where the watcher may report a target outside the configured root, as the doc comment warns) are uncovered.
Fix: Add a test that creates a symlinked directory under a root, enables FollowSymlinkDirs, and asserts that Discover finds the .jsonl files reached through the symlink (and, ideally, that SourcesForChangedPath behaves as documented for a path under the symlink).


Severity: Low
Location: internal/parser/jsonl_source_set.gosourceForPath
Problem: sourceForPath calls s.pathIncluded(root, path) and then immediately calls s.sourceRef(root, path, info), which itself calls s.pathIncluded(root, path) again for the same (root, path). The first check is redundant (unlike sourceForMissingPath, which legitimately needs its own pathIncluded call because it bypasses sourceRef).
Fix: Drop the standalone pathIncluded call in sourceForPath and let sourceRef perform the check, or add a brief comment if the early-out is intentional.

Summary

Adds a well-structured, value-immutable JSONLSourceSet helper for JSONL transcript discovery/watch/lookup/fingerprinting with correct root-scoping and changed-path fallback logic; only minor test-coverage and redundancy nits noted.


claude-code — security (done)

I've reviewed the new jsonl_source_set.go helper and verified the supporting types and validation functions it relies on.

Summary

This commit adds a single new file (internal/parser/jsonl_source_set.go) plus tests — a reusable helper that discovers, watches, locates, and fingerprints JSONL transcript files under configured roots. I focused on the security-relevant surfaces: path handling (traversal/symlink), the session-ID lookup path, and the file-read/hash operations.

The path handling is consistently defensive, and every filesystem operation is gated by a trust-boundary-appropriate check:

  • Path traversal is contained. Both sourceForPath and sourceForMissingPath filepath.Clean the input and require it to pass pathAllowedByRootpathIsUnderRoot, which uses filepath.Rel and rejects any ..-escaping relative path before any os.Stat/os.Open. A crafted change-event path like root/../../etc/passwd is cleaned and rejected.
  • The RawSessionID lookup cannot be used for traversal. IsValidSessionID (confirmed in discovery.go:831) restricts IDs to [A-Za-z0-9_-], so ../session is rejected; and even a valid ID is only string-compared against discovered session IDs, never used to construct a path.
  • Fingerprint's direct use of JSONLSource.Path (without re-validating in pathFromSource) is safe because Opaque is an in-memory, provider-owned any set only by sourceRefFromPath with an already-validated path; it is not deserialized from persisted/untrusted data (the DB persists the string keys, which are re-resolved through the validated sourceForPath).
  • Symlink directory following (FollowSymlinkDirsisDirOrSymlink, which os.Stats through the link) can ingest JSONL files outside a root, but only via a symlink an actor placed inside the user's own session directory — same-user write access reading the user's own readable files, with the result displayed back to that same user. No trust boundary is crossed (guidelines 8 and 12, and same-user local access is not an attacker boundary). Regular-file sources use os.Lstat/entry.Info() and reject symlinked files, so this is limited to the documented, opt-in legacy directory behavior.
  • The unbounded io.Copy hash read operates on local, user-owned files in a single-user tool; no DoS boundary applies (guideline 7).

This is additive helper code operating entirely on local, user-owned session files in a single-user tool. No attacker-controlled input reaches a security-sensitive sink across a trust boundary, and no existing control is weakened.

No issues found.

@mariusvniekerk mariusvniekerk force-pushed the provider-jsonl-source-set branch from 216e363 to 5e06aa5 Compare June 25, 2026 05:47
@roborev-ci

roborev-ci Bot commented Jun 25, 2026

Copy link
Copy Markdown

roborev: Combined Review (5e06aa5)

Summary verdict: One medium-severity correctness issue needs attention; no security issues were identified.

Medium

  • internal/parser/jsonl_source_set.go:185: FindSource uses the sorted Discover result for RawSessionID fallback, so duplicate session IDs across roots are resolved by lexicographic display path instead of configured root order. For roots like live sessions before archived_sessions, an archived or stale source can win.
    • Fix: Resolve raw session IDs using root/traversal order before sorting, or add an unsorted discovery path for lookup.

Panel: ci_default_security | Synthesis: codex, 6s | Members: codex_default (codex/default, done, 6m4s), codex_security (codex/security, done, 45s) | Total: 6m55s

Introduce the SourceSet bases (JSONL, directory JSONL, single-file,
multi-session container, sibling-metadata, SQLite fan-out), the
functional with*() option set, the generic SourceSet provider/factory
plumbing, and the virtual-path and source-identity helpers up front, so
every provider migration constructs its source set through options
instead of a struct literal.
@mariusvniekerk mariusvniekerk force-pushed the provider-jsonl-source-set branch from 5e06aa5 to 8a4b201 Compare June 25, 2026 17:27
@roborev-ci

roborev-ci Bot commented Jun 25, 2026

Copy link
Copy Markdown

roborev: Combined Review (8a4b201)

Summary verdict: one Medium issue should be addressed; no High or Critical findings were reported.

Medium

  • Location: internal/parser/sqlite_fanout_source_set.go:116
  • Problem: SourcesForChangedPath returns immediately when ListMeta fails, before adding StoredSourcePaths. If the SQLite DB is deleted or missing, the helper drops the stored virtual sources needed for tombstone handling, so disappeared sessions may never be emitted for cleanup.
  • Fix: Treat missing DB metadata as an empty current set and still append matching StoredSourcePaths; only return hard metadata errors after preserving tombstone sources where possible.

Panel: ci_default_security | Synthesis: codex, 7s | Members: codex_default (codex/default, done, 6m54s), codex_security (codex/security, done, 1m5s) | Total: 8m6s

@mariusvniekerk

Copy link
Copy Markdown
Collaborator Author

Superseded by the family-grouped collapse of this stack into 10 green branches (#876#885), which preserves every commit while cutting the branch count for review. Closing in favor of that stack.

generated by a clanker

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

1 participant