Skip to content

Add optional AI-vision PDF extraction and harden BYO-key bridge end-to-end#14

Merged
DaScient merged 2 commits intomainfrom
copilot/harden-regex-extractors-ai-vision
May 1, 2026
Merged

Add optional AI-vision PDF extraction and harden BYO-key bridge end-to-end#14
DaScient merged 2 commits intomainfrom
copilot/harden-regex-extractors-ai-vision

Conversation

Copy link
Copy Markdown
Contributor

Copilot AI commented May 1, 2026

Summary

Augments the deterministic regex/rule extractor with an optional, opt-in AI-vision pass that rasterizes PDF pages through the user's BYO-key LLM, and audits the BYO-key surface (storage, transport, redaction, validation, consent). The regex/rule path remains the source of truth at every stage — vision is strictly additive and degrades gracefully on failure.

Vision-augmented PDF extraction (additive, gated)

  • Extractors.rasterizePdfPages() — OffscreenCanvas/Canvas, ~1600 px long-edge cap, JPEG q 0.85, page-count + ~18 MB byte budgets, abort-aware. Logs only counts/dimensions/sha256 — never image bytes.
  • LLM.extractFactsFromImages() — strict-JSON system prompt, tolerant parser (code fences, trailing prose, trailing commas), schema-clamped fact map with confidence + evidence per field.
  • Extractors.mergeVisionFacts() — agree (5% tolerance) → regex+vision; disagree → keep regex + record conflict; miss+hit → vision; miss+miss → unchanged. Provenance carries runId, model, page, evidence.
  • Pipeline extract step runs vision after regex when gated by LLM.isConfigured() && isVisionCapable() && consent && useVision && PDF. Per-page progress events, AbortSignal-aware. On failure: warn, set visionStatus: 'failed', continue with regex-only. Each conflict becomes a REVIEW finding.

BYO-key hardening

  • PE.LLM rewrite: schema-versioned config; localStorage or sessionStorage (Session-only); VISION_MODELS capability table → visionCapability() returns vision/text/unknown.
  • Network hygiene: every outbound LLM call routes through one chokepoint with referrerPolicy: 'no-referrer', cache: 'no-store', credentials: 'omit'. Per-request timeouts (60 s text / 120 s vision). Bounded retry+jitter on 429/5xx; no retry on 401/403. Errors normalized to { status, code, message, retryable }.
  • Header isolation: each provider receives only its expected auth header — no Authorization to Anthropic/Azure/Ollama, no x-api-key to OpenAI, no api-key outside Azure (regression-tested).
  • Base URL validation: https:// required for cloud providers; loopback-only for Ollama (opt-in for remote); query strings, fragments, and .. segments rejected.
  • Multimodal send() with per-provider image adapters (OpenAI/Azure: image_url; Anthropic: base64 blocks; Ollama: separate images: [base64,…]). ping() powers the Test connection button.

Logger redactor (defense in depth)

  • PE.Log strips any field whose name matches apiKey/api-key/Authorization/x-api-key/Bearer/token/secret/password/cookie/session, plus inline ****** / Basic …strings — applied to both buffer and downloaded.log`.

UI

  • AI Settings: vision toggle, max-pages input, capability badge (✅/⚠/❔), show/hide-key eye, Session-only toggle, Test connection, base-URL/key-shape validation, outbound-endpoints preview, Allow remote Ollama row, Revoke vision consent.
  • One-time Vision Consent dialog before the first vision run; revocable from AI Settings.

Docs

  • README.md — text-only vs vision modes, capability table, caps/budgets, sent-vs-not-sent table, revoke path.
  • SECURITY.md — threat model with per-row mitigations (XSS/localStorage, log leakage, header leakage, path traversal, hung tabs, image-byte leakage).
  • TEST_PLANS.md — vision walkthrough on a public sample PDF including failure-path and revoke verification.

Type of Change

  • Bug fix (non-breaking change that fixes an issue)
  • New feature (non-breaking change that adds functionality)
  • New rule pack (new jurisdiction or code edition)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Documentation update
  • Accessibility improvement

Rule Pack Changes (if applicable)

N/A — no rule pack changes.

Testing

  • Tested manually in Chrome/Firefox/Safari
  • Tested with a sample PDF plan
  • Tested with a sample DXF plan
  • Tested with a sample DOCX plan
  • Rule pack JSON validates (run python3 -c "import json; json.load(open('assets/data/rules/your-pack.json'))")

Automated coverage extended from 27 → 79 tests, all passing under node --test. New suites:

  • tests/log.redaction.test.js — buffer never contains common credential keys/values.
  • tests/llm-bridge.test.js — capability detection, URL validation, error normalization, vision-JSON parsing, schema clamping, header isolation, send() capability gating, key-shape warnings, consent persistence.
  • tests/extractors.vision.test.js — merge agree/disagree/miss/miss, provenance, confidence threshold, malformed-vision safety.
  • tests/pipeline.vision.test.js — gating (no key / no consent / text-only / toggle off), successful merge, conflict→REVIEW, graceful degradation on vision failure.

Manual browser verification per the new walkthrough in TEST_PLANS.md is recommended before merge.

Checklist

  • My code follows the existing code style
  • I have performed a self-review of my changes
  • No new API keys, secrets, or credentials are committed
  • Relevant documentation has been updated

Screenshots (if applicable)

N/A — UI text changes only; existing dark/glass styling preserved. The new AI Settings panel adds vision toggle, capability badge, Test-connection button, Session-only toggle, and outbound-endpoints preview within the existing modal frame.

Copilot AI and others added 2 commits May 1, 2026 20:11
…, consent), update README/SECURITY/TEST_PLANS

Agent-Logs-Url: https://github.com/DaScient-Intelligence/Plan-Examiner/sessions/12d85c28-ffe2-4719-9c89-2471b6fcd098

Co-authored-by: DaScient <25983786+DaScient@users.noreply.github.com>
@DaScient DaScient marked this pull request as ready for review May 1, 2026 20:27
Copilot AI review requested due to automatic review settings May 1, 2026 20:27
@DaScient DaScient merged commit 8990937 into main May 1, 2026
8 checks passed
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds an opt-in AI-vision augmentation to the existing deterministic PDF extraction pipeline, plus hardening of the BYO-key LLM bridge (storage, transport, validation, redaction, consent) and associated UI/docs/tests.

Changes:

  • Introduces optional PDF page rasterization + vision fact extraction, merges vision facts additively, and surfaces regex↔vision conflicts as REVIEW findings.
  • Reworks PE.LLM into a single outbound-request chokepoint with schema-versioned config, session-only storage option, base-URL validation, capability detection, multimodal send(), retries/timeouts, and header isolation.
  • Adds extensive node --test coverage and updates documentation (README/SECURITY/TEST_PLANS) for vision mode and BYO-key hardening.

Reviewed changes

Copilot reviewed 13 out of 13 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
tests/pipeline.vision.test.js Verifies pipeline vision gating, merge behavior, conflict→REVIEW finding, and graceful degradation.
tests/log.redaction.test.js Ensures logger redaction prevents credential leakage in buffer and formatted output.
tests/llm-bridge.test.js Covers LLM capability detection, URL validation, error normalization, header isolation, and vision JSON parsing/clamping.
tests/extractors.vision.test.js Unit-tests vision/regex merge logic, provenance, conflicts, and malformed/low-confidence handling.
index.html Expands AI Settings UI and adds a one-time vision consent modal.
assets/js/utils/log.js Adds defense-in-depth redaction for sensitive keys/values in log messages and data payloads.
assets/js/app.js Updates AI settings UX (test connection, session-only, vision toggle/pages, consent flow) and nav badge.
assets/js/agent/pipeline.js Adds vision sub-step after regex extraction, merges facts, records provenance, and emits REVIEW findings on conflicts.
assets/js/agent/llm-bridge.js Major rewrite: storage options, base URL validation, capability detection, unified fetch w/ retries+timeouts, multimodal support, ping/endpoints, vision JSON parsing and schema clamping.
assets/js/agent/extractors.js Adds PDF rasterization helper and deterministic merge routine for vision facts.
TEST_PLANS.md Adds manual vision-mode verification walkthrough and references new test suites.
SECURITY.md Expands threat model and mitigations for BYO-key + vision mode.
README.md Documents text-only vs vision mode, capability/budgets, outbound endpoints, and consent revocation.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

catch (e) { return { ok: false, reason: 'Not a valid URL.' }; }

if (u.search || u.hash) return { ok: false, reason: 'Base URL must not contain query string or fragment.' };
if (/(?:^|\/)\.\.(?:\/|$)/.test(urlStr)) return { ok: false, reason: 'Base URL must not contain "..".' };
Copy link

Copilot AI May 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

validateBaseUrl() claims to reject paths containing "..", but the check runs against the raw urlStr and only matches literal ... Percent-encoded traversal segments (e.g. %2e%2e) or other normalized forms can bypass this and allow the base URL to resolve outside the intended path when you later append endpoints. Consider validating against u.pathname after decoding/normalizing path segments (and rejecting any segment that equals ..).

Suggested change
if (/(?:^|\/)\.\.(?:\/|$)/.test(urlStr)) return { ok: false, reason: 'Base URL must not contain "..".' };
var pathSegments = u.pathname.split('/');
for (var i = 0; i < pathSegments.length; i++) {
var decodedSegment;
try { decodedSegment = decodeURIComponent(pathSegments[i]); }
catch (e2) { return { ok: false, reason: 'Not a valid URL.' }; }
if (decodedSegment === '..') return { ok: false, reason: 'Base URL must not contain "..".' };
}

Copilot uses AI. Check for mistakes.
Comment on lines +319 to +325
function _normalizeMessages(messages) {
// Accept either plain string content or already-multimodal arrays.
return messages.map(function (m) {
return { role: m.role, content: m.content };
});
}

Copy link

Copilot AI May 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_normalizeMessages() is defined but never used. Keeping unused helpers increases maintenance surface (and can confuse future refactors around message normalization). Either remove it or wire it into send() so there’s a single canonical message-shape normalization step.

Suggested change
function _normalizeMessages(messages) {
// Accept either plain string content or already-multimodal arrays.
return messages.map(function (m) {
return { role: m.role, content: m.content };
});
}

Copilot uses AI. Check for mistakes.
});

if (totalBytes + bytes > byteBudget) {
_log('warn', 'vision byte budget reached; truncating at page ' + i, { totalBytes: totalBytes, budget: byteBudget });
Copy link

Copilot AI May 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The truncation warning logs page " + i, but i is zero-based in this loop. This will report the wrong page number (e.g. "page 0") when the byte budget is reached. Log i + 1 (and/or the attempted page index) so the message matches the actual page where truncation occurred.

Suggested change
_log('warn', 'vision byte budget reached; truncating at page ' + i, { totalBytes: totalBytes, budget: byteBudget });
_log('warn', 'vision byte budget reached; truncating at page ' + (i + 1), { totalBytes: totalBytes, budget: byteBudget });

Copilot uses AI. Check for mistakes.
Comment on lines +367 to +370
var imgs = [];
var flat = messages.map(function (m) {
if (typeof m.content === 'string') return m;
var text = '';
Copy link

Copilot AI May 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In _ollamaImagesFromMessages, imgs is declared outside the per-message mapping and then appended to for every message. That means images can accumulate across multiple chat turns and get attached to later messages that didn’t include image parts (and potentially to assistant messages), which can produce an invalid/overshared payload for Ollama multimodal requests. Build an images array per message (or attach images only to the message that actually contains them) instead of reusing a shared array.

Suggested change
var imgs = [];
var flat = messages.map(function (m) {
if (typeof m.content === 'string') return m;
var text = '';
var flat = messages.map(function (m) {
if (typeof m.content === 'string') return m;
var text = '';
var imgs = [];

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants