fix: green the release gate — tesseract local-only, PP-OCRv5 availability, OCR blockId, reproducible models#3
Conversation
Land the first of the three-layer post-conversion verification system
(rule diff + SSIM + OCR readback). Introduces public/core/verification/
with shared block fingerprints, field-level diffSemanticDocs, and a
runVerificationStage orchestrator wired after the Repair Engine cycle.
- block-fingerprint.js: lift blockFingerprint/modelFingerprint out of
repair-engine (byte-for-byte identical) + getBlockKey/extractBlockFields
+ ROUND_TRIP_FORMATS single source
- rule-diff.js: diffSemanticDocs -> { identical, blockCounts, changed/
added/removedBlocks, fidelity, overallScore } with minor/major severity
- verification-stage.js: gating + same-format readback + md<->html
cross-format loopback + RULE_DIFF_DRIFT / RULE_DIFF_READBACK_FAILED
- format-registry: write qualityReport.ruleDiff + .verification envelope;
repair-engine roundTripDelta contract unchanged
- scripts/rule-diff-test.js (10 assertions) wired into npm test (21 total)
- guard scripts + MULTI_MODEL_ARCHITECTURE + DEVELOPMENT_TASKS synced
- two specs: P9-C overall + P9-C.1 sub-stage
npm test (21 scripts), git diff --check, release:prepare all pass.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Land the second of the three-layer post-conversion verification system.
SSIM uses visual-loopback semantics: for visual-preserving inputs
(pdf/png), rasterize the input page and output page and compare
structural similarity, writing qualityReport.ssim.
- ssim.js: self-implemented, zero-dependency SSIM core (rgbaToGrayscale,
resampleGrayscale box resample, computeSSIM windowed mean, compareImages)
- page-image-source.js: pixel-source abstraction (Node throws
VERIFICATION_IMAGE_SOURCE_UNAVAILABLE, browser auto-loads canvas impl,
setPageImageSource for tests); RASTERIZABLE_FORMATS = {pdf, png}
- page-image-source-browser.js: vendor pdfjs + canvas getImageData
- verification-stage: runSsimLayer + async runVerificationStageAsync that
merges the sync rule-diff base with the async SSIM layer; sync
runVerificationStage unchanged (qualityReport.ssim stays null)
- format-registry: extract _runRepairCycle + _assembleQuality; convert()
stays sync (rule-diff), convertAsync() uses async wrap (rule-diff + ssim)
- scripts/ssim-verification-test.js (12 assertions) wired into npm test (22)
- guards + MULTI_MODEL_ARCHITECTURE + DEVELOPMENT_TASKS synced
Rendering is stub-only this round (Node has no canvas); real PDF/PNG
fixtures and browser end-to-end deferred. No new npm deps.
npm test (22 scripts), git diff --check, release:prepare all pass.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Land the third and final layer of the post-conversion verification system.
OCR readback rasterizes the output (PDF) and re-reads it with the OCR
engine, comparing recognized text against the original SemanticDoc text
via a character-level multiset similarity, writing qualityReport.ocrReadback.
- ocr-readback.js: compareText (char-level multiset recall/precision/f1,
robust for CJK + OCR noise) + normalizeText + extractModelText +
runOcrReadbackLayer; reuses registered ocr-text engine + OCR pdf rasterizer
- verification-stage: runVerificationStageAsync dynamic-imports the readback
layer (keeps OCR off the sync convert path) and merges it into the envelope
- format-registry: _assembleQuality adds ocrReadback (null on sync path)
- scripts/ocr-readback-test.js (13 assertions) wired into npm test (23 total)
- guards + MULTI_MODEL_ARCHITECTURE + DEVELOPMENT_TASKS synced
Three layers (rule-diff + ssim + ocr-readback) now write a unified
qualityReport.{ruleDiff,ssim,ocrReadback} + verification envelope.
Rendering/OCR are stub-only this round; no new npm deps.
npm test (23 scripts), git diff --check, release:prepare all pass.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Add a programmatic sample-corpus generator to stress-test conversion, layout, and the three-layer verification across every supported format at varied sizes (large tier >= 3MB). Binaries stay git-ignored per the existing programmatic-fixture policy. - scripts/lib/sample-content.js: deterministic complex content builders (md/html/json/xml/csv/txt) with headings, nested lists, task items, aligned tables, multi-lang code, nested quotes, footnotes, images, CJK/RTL/emoji/entities; SIZE_TIERS + buildToTargetBytes - scripts/lib/png-encode.js: minimal node:zlib PNG encoder (no new deps) - scripts/generate-samples.js: emits md/html/txt/json/xml/csv natively, docx/pptx/epub/pdf/xlsx via project writers, png via encoder; writes MANIFEST.json with coverageGaps for doc/ofd (no writer); --tiers/--out - scripts/sample-corpus-test.js: fast in-memory round-trip gate (24th npm test) — no 3MB writes - samples:generate npm script; samples/generated/ git-ignored - samples/fixtures/README + DEVELOPMENT_TASKS documented Verified large tier: text formats >=3MB, pdf 19MB, docx 13.6MB, xlsx 16MB, epub 4.9MB. npm test (24 scripts), git diff --check, release:prepare pass. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…9-C.4)
The P9-C.1/2/3 layers compute qualityReport.{ruleDiff,ssim,ocrReadback}
plus a verification envelope, but transformContent discarded result.quality
and the old bottom drawer was already removed — so the core "post-conversion
verification" differentiator was invisible. Surface it.
- index.html: new collapsible #verificationReportPanel inside the output
panel (auto-repair verdict + rule-diff / SSIM / OCR-readback rows + warning
counts + active-layer badge); does not revive the removed drawer ids
- app.js: renderVerificationReport(quality) (textContent only, no innerHTML),
captures result.quality into currentConversionQuality, renders on both
text and binary paths, clears on reset
- styles.css: .verification-report/.verification-row/.verification-badge,
data-state ok/drift/skip coloring via :has() left border
- browser-smoke-test: assert #verificationReportPanel present; removed-drawer
negative assertions still hold
Display-only; no change to conversion core / verification-stage. Skipped
layers show their reason (honest: not-triggered != failed). No new deps.
npm test (24 scripts), git diff --check, release:prepare all pass.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
landing-view.js imports getKnownInputFormats from browser-transformer.js, but it was never re-exported (only defined in format-registry.js). In the browser this is a module-load SyntaxError that takes down landing-view.js and app.js, leaving the landing page blank below the header. Node-side tests never caught it because browser-smoke-test only asserted static HTML strings, not that the module graph loads. - browser-transformer.js: import + re-export getKnownInputFormats - browser-smoke-test.js: load browser-transformer/router/landing-view module graphs and assert key re-exports exist, so a missing re-export fails CI instead of silently blanking the UI npm test (24 scripts) passes. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Research showed the roadmap's named advanced-OCR targets (PaddleOCR-VL / MinerU) are not embeddable under the project's constraints (browser/Tauri local, no cloud, 30-80MB default bundle, no Python runtime): the VLM has no mature ONNX/WebGPU path (~500MB + 1-2GB VRAM or vLLM), MinerU is a Python/vLLM tool. Per user confirmation, the built-in advanced-OCR target is now PP-OCRv5 (ONNX Runtime + WebGPU, WASM fallback); VLMs are marked as far-term/external. - specs: 2026-05-29-p9d-advanced-ocr-research.md (with sources) + 2026-05-29-p9d1-paddle-ocr-skeleton-design.md - paddle-ocr-engine.js: paddleOcrEngine (id paddleocr-v5, ocr-text/ocr-layout) implements OCREngine; isAvailable() false in Node; recognize() three-stage rejection (vendor-not-ready / model-missing / runtime-not-wired); markPaddleOcrVendorReady. No onnxruntime, no real inference this round. - paddle-ocr-bootstrap.js: registers engine (after tesseract) + PP-OCRv5 ONNX ModelManifest (engine paddleocr, int8, det/cls/rec) as not-downloaded - browser-transformer: import bootstrap + export paddle API - ocr-baseline-test: pickForTask fallback set + 35th block (paddle skeleton) - guards: local-security ALLOWED/STRICT; direction-test PP-OCRv5/ONNX/WebGPU - direction docs reframed: PP-OCRv5 built-in, VLM far-term/external (kept "default bundle excludes GB-scale models") Skeleton-first (mirrors P9-A.2 tesseract). npm test (24), git diff --check, release:prepare all pass. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Wire the ONNX Runtime layer for PP-OCRv5 advanced OCR, mirroring the tesseract P9-A.2 vendor+skeleton stage. Real det/cls/rec inference pipeline and CTC decode are deferred to P9-D.2.b (needs real models + dictionaries). - scripts/sync-onnxruntime-vendor.js: sync ort*.mjs + *.wasm from node_modules/onnxruntime-web/dist to public/vendor/onnxruntime/; exit 0 if the optionalDependency is absent (does not block install/release) - paddle-ocr-runtime.js: loadOnnxRuntime (same-origin vendor dynamic import, sets ort.env.wasm.wasmPaths same-origin, throws OCR_VENDOR_LOAD_FAILED in Node), pickExecutionProviders (navigator.gpu -> [webgpu,wasm] else [wasm]), createOcrSession/disposeOcrSession skeleton, PADDLE_VENDOR_PATHS - paddle-ocr-engine.recognize: third stage now loads the runtime; in a browser with vendor+models it reports pipeline-not-wired (P9-D.2.b) - package.json: onnxruntime-web optionalDependency + vendor:onnx + release:prepare runs the onnx vendor sync - browser-transformer exports the runtime API - guards: local-security recognizes public/vendor/onnxruntime/ + STRICT; direction-test asserts onnxruntime-web; release-readiness expects new release:prepare/vendor:onnx scripts - ocr-baseline-test 36th block (EP selection + Node vendor-load reject) Tauri CSP already covers ORT (wasm-unsafe-eval + worker-src blob: + connect-src 'self'); no change. npm test (24), git diff --check, release:prepare all pass. No forced install of onnxruntime-web. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
… fix Let users import PP-OCRv5 det/cls/rec ONNX models into the local cache from the security center, flipping paddleOcrEngine from model-missing toward ready. Reuses the tesseract tessdata local-import pattern (file picker + SHA-256 + IndexedDB) — the project's no-network STRICT guard forbids remote fetch, so "on-demand download" means local user import. - security-center.js: renderPaddleActions for engine=paddleocr rows (import det.onnx/cls.onnx/rec.onnx + clear); importPaddleModel (sha256 -> store at paddleocr/v5/<file> -> ensureProbe; only flips to AVAILABLE + sets vendor ready when all three present, else reports which are missing); clearPaddleModels; click delegation for data-import-paddle/data-clear-paddle - Fix latent bug: paddleOcrEngine/tesseractOCREngine stored readiness on a frozen instance prop, so ensureProbe()'s assignment throws "Cannot assign to read only property" in ES-module strict mode — which made the security center import flow fail silently (caught as "import failed"). Readiness now lives in module-level state; the engine objects stay frozen. - ocr-baseline-test: 37th block (availability flips with model presence + vendor flag) + ensureProbe no-throw assertion on the frozen tesseract engine - browser-smoke: assert #modelCacheFileInput present - docs/DEVELOPMENT_TASKS synced npm test (24), git diff --check, release:prepare all pass. No remote fetch; det/cls/rec inference + CTC decode deferred to P9-D.2.b. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Implement the real PP-OCRv5 inference pipeline as pure, fully-testable
functions plus an orchestrator that takes injectable sessions — so it runs
end-to-end in Node with mock sessions + synthetic tensors, no real models.
- paddle-ocr-pipeline.js: parseCharDictionary, preprocessForDetection
(ImageNet norm + multiple-of-32 + limit_side_len), preprocessForRecognition
(height 48, [-1,1]), dbPostProcess (threshold + 4-connected components +
axis-aligned bbox + box-score filter + scale-back + reading order),
ctcGreedyDecode (argmax -> collapse repeats -> drop blank -> map dict),
cropImageData, resizeRgba; runPaddlePipeline({ ort, det/cls/recSession,
imageData, dictionary }) -> OCRResult
- paddle-ocr-engine.recognize: in-browser decode (Image+canvas, no fetch ->
respects the no-network guard) -> load det/cls/rec buffers + optional dict
from cache -> createOcrSession x3 -> runPaddlePipeline; Node still rejects
at loadOnnxRuntime before any browser-only path
- browser-transformer exports the pipeline API
- scripts/paddle-ocr-pipeline-test.js (9 blocks incl. mock-session e2e
decoding "HI") wired into npm test (25th script)
- guards: local-security ALLOWED/STRICT; direction-test runPaddlePipeline/
ctcGreedyDecode
- resource-budget: public/core 256KB -> 320KB (pure-JS algorithm code, no
model weights; models stay in model-cache) with rationale
cls angle correction / minAreaRect+unclip / multi-column deferred. Real-model
end-to-end is browser/manual. npm test (25), git diff --check,
release:prepare all pass.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…eract)
Capstone for the PP-OCRv5 local advanced-OCR chain: prefer paddle over
tesseract when both are available, so PNG / scanned-PDF OCR stages
automatically use the higher-accuracy engine.
- ocr-engine.js pickForTask: priority-aware — sort task candidates by
priority (desc, default 0) and pick the first available; fall back to the
last-registered candidate when none are available (unchanged)
- paddle-ocr-engine priority 20, tesseract-engine priority 10
(placeholder stays 0) -> PP-OCRv5 wins when available
- PNG / scanned-PDF stages need no change: enhanceWithOCR /
runScannedPdfOCRStage resolve via pickForTask("ocr-text")
- ocr-baseline-test 38th block: priority selection unit + default-registry
preference flip (both available -> paddleocr-v5; remove paddle models ->
tesseract-zh-en)
The PP-OCRv5 chain (contract -> runtime -> model import -> inference
pipeline -> route preference) is now complete; real-model end-to-end is a
browser/manual step. npm test (25), git diff --check, release:prepare pass.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The PP-OCRv5 chain (P9-D.1..D.4) is fully covered in Node via mock sessions and pure-function tests; real ONNX inference can only run in the browser/Tauri (WebGPU/WASM). Document the manual steps: install+vendor onnxruntime-web, import det/cls/rec ONNX + dictionary via the security center, verify engine priority, real PNG/scanned-PDF recognition, the verification report, and the no-network guarantee. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Make advanced OCR actually usable (P9-D.1..D.4 were contract/runtime/pipeline
scaffolding with no real model). Install the real runtime + models and
auto-load them on launch — no manual import needed.
- onnxruntime-web installed (optionalDependency 1.26.0); sync-onnxruntime-vendor
trimmed to the minimal set (ort.min.mjs + ort-wasm-simd-threaded.jsep.{mjs,wasm},
JSEP build covers WebGPU+WASM, ~25MB; drops ~68MB of redundant variants)
- real PP-OCRv5 mobile ONNX (det 4.8MB / cls 0.58MB / rec 16.6MB) + dict
downloaded to public/vendor/paddleocr/ (from the OnnxOCR PP-OCRv5 ONNX repo)
- paddle-default-models.js: ensurePaddleDefaultModels() idempotently fetches
the same-origin /vendor/paddleocr/ models into defaultOCRStorage (IndexedDB),
marks the engine ready + probes -> advanced OCR works out of the box; missing
vendor is silently skipped (manual security-center import still works)
- app.js calls it fire-and-forget on init; browser-transformer exports it
- .gitignore excludes the heavy onnxruntime + paddleocr vendors (reproducible
via npm i + vendor script + local download; bundled into the build from disk)
- guards: paddle-default-models ALLOWED/STRICT; isLocalVendorAsset trusts the
onnxruntime vendor (its minified bundle has CDN strings; the no-network
guarantee comes from pinning ort.env.wasm.wasmPaths same-origin + Tauri CSP
connect-src 'self'); public/vendor budget 6MB -> 64MB with rationale
Real ONNX inference runs in the browser/WebGPU only; Node stays mock/pure-fn
covered (npm test 25 green). Browser e2e steps: docs/PP_OCRV5_BROWSER_VERIFICATION.md
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Render LaTeX math in document previews. The markdown inline tokenizer previously ate math (\frac -> rac via backslash escaping; _ -> <em>), so math is now a protected token. - inline-tokens.js: recognize $$...$$ (display) and $...$ (inline) BEFORE escape handling; capture content verbatim (no recursion/escaping). Inline heuristic (no inner-edge whitespace) excludes currency like "$5 and $10" - semantic-inlines.js: createInlineMath + math node rendered in plainText/markdown/html; HTML emits <span class="t2f-math" data-tex="raw" data-display=".."> with a no-JS fallback ($tex$); markdown round-trips - vendored KaTeX v0.17.0 (css + js + 20 woff2 fonts, ~592KB) at public/vendor/katex/; katex-render.js renderMathIn() typesets .t2f-math spans via the global katex (same-origin, zero network; silent fallback if katex absent) - index.html/preview.html load katex css + defer js; app.js (3 preview sites) + preview.js call renderMathIn after rendering - local-security: trust the katex vendor (its only http strings are W3C MathML/SVG namespaces, not network) - scripts/latex-math-test.js (7 blocks: tokenization with backslash+underscore preserved, currency exclusion, katex-targetable span, md round-trip, plain-text + factory, md->html no <em>) -> npm test (26th) npm test (26) passes; existing conversion snapshots survive the new token. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
OCR was unreachable in the running app: convertWithWorker sent conversions to a Web Worker that uses sync convertContent (no OCR stage), and OCR needs main-thread canvas decode + onnxruntime which a Worker can't provide. So image and scanned-PDF inputs never triggered OCR. Route png/pdf inputs to convertContentAsync on the main thread (text PDFs still take the normal path; images and scanned PDFs now run OCR), and ensure the bundled PP-OCRv5 models are loaded first (idempotent ensurePaddleDefaultModels). OCR has no dedicated button — it runs automatically when converting an image or scanned PDF to a text format. npm test (26) passes. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…urate Validated the inference pipeline against the real PP-OCRv5 ONNX models with onnxruntime-node (dev-only). Findings: - rec is correct: decodes the PaddleOCR word_10 crop to "PAIN", output classes C=18385 exactly match the dictionary (blank + 18383 + space). RGB channel order + (x/255-0.5)/0.5 normalization are right (no BGR needed) - det is correct: produces a probability map; 16 text boxes on a real document - root cause of garbled output: dbPostProcess emitted tight axis-aligned bboxes without PP-OCR's "unclip" expansion, clipping character strokes -> wrong recognition (avgConf 0.41, CJK gibberish). Adding unclip (distance = area*ratio/perimeter, unclipRatio default 1.6) yields coherent, correct text on a full product-label document (avgConf 0.978) - dbPostProcess: add unclipRatio param + outward box expansion - pipeline unit test: assert exact bbox with unclipRatio:0, plus an expansion assertion - scripts/paddle-ocr-integration-test.js: runs the real rec model via onnxruntime-node on a committed fixture (samples/ocr/word-PAIN.png), asserts "PAIN" + C==dict; gracefully skips (exit 0) when the dev deps / models / fixture are absent, so npm test stays green in CI. Wired in (27th) Real rec hits 0.991 confidence locally. npm test (27) passes. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…ntrol Improve hard-case recognition and add explicit quality assessment, validated against the real PP-OCRv5 models with onnxruntime-node. - orientation: the cls model outputs [1,2] softmax over [0°,180°] (upright->[1,0], flipped->[0,1]). runPaddlePipeline now applies it via interpretClsOutput, rotating 180° crops before rec. Validated: a fully upside-down document is read correctly (all 16 lines flipped, avgConf 0.976) - vertical/sideways: tall boxes (h/w > verticalAspect) additionally try 90° cw/ccw and keep the highest-confidence read (robust for sideways labels; cost limited to the rare tall boxes) - rotateImageData180 / rotateImageData90(dir) / interpretClsOutput pure helpers (unit-tested: 180 is its own inverse, 90 swaps dims + maps corners, cls threshold branches) - quality control: runPaddlePipeline returns a quality summary (lineCount, averageConfidence, minConfidence, lowConfidenceLines, rotatedLines, grade high/medium/low); enhanceWithOCR records result.quality into metadata.modelReview.ocrQuality, alongside the existing per-line confidence + detectOCRLowConfidence validator + P9-C OCR-readback verification - pipeline test: rotation/cls/quality assertions (12 blocks total) Still limited: strong italic / ornate artistic text (needs minAreaRect + perspective-warp polygon boxes and a stronger rec model). npm test (27) passes. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…rescued) Per the steer — only convert the text content (no artistic-style preservation), focus on denoising. Add image denoising that cleans noisy/artistic-background images without hurting clean ones. - denoiseImageData: per-channel 3x3 median filter (edge-preserving, removes salt-and-pepper); estimateNoiseLevel: fraction of isolated pixels that jump far from their 3x3 median (separates speckle from text edges) - runPaddlePipeline options.denoise = "auto"(default)|true|false. AUTO denoises only when estimateNoiseLevel > threshold (default 0.05), because median filtering softens clean text (measured 0.974 -> 0.903), so clean images are never denoised - validated on real models: clean doc (noise 0.016) stays untouched at 0.974; 15% salt-and-pepper (noise 0.10) is denoised, recovering detection from 4 -> 16 lines and avgConf 0.692 -> 0.832 (heavy noise collapses detection; denoise rescues it) - quality summary adds denoised + noiseLevel - pipeline test: noise-estimate / median / auto-gating assertions (14 blocks) Explicitly not doing artistic-style preservation (text content only) or minAreaRect perspective. npm test (27) passes. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Skew is the biggest remaining recognition gap: a +10° rotated document drops from 16 lines/0.974 to 8 lines/0.535 (detection collapses). Auto-deskew fixes it. - rotateImageDataByAngle(im, deg): arbitrary-angle nearest-neighbor rotation with canvas expansion + white background - estimateSkewAngle(probData, mapW, mapH): shear-projection histogram variance on the (binarized, downsampled) det probability map -> the angle that makes text rows most horizontal (the deskew rotation) - runPaddlePipeline: extract a detect(image) helper; options.deskew = true(default)|false. Estimate skew from the first det; if |est| >= minSkew (default 3°), rotate the image upright and re-detect, then proceed. Upright images estimate ~0 and skip the second det (zero overhead); rec always runs once - validated on real models: +10° doc recovers 8/0.535 -> 16 lines/0.970 (skewApplied=10), -8° recovers too, upright untouched (skewApplied=0, 0.974) - quality summary adds skewApplied - pipeline test: arbitrary rotation + skew estimation (slanted synthetic rows detected ~angle, flat rows ~0) — 16 blocks npm test (27) passes. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Address "enhance recognition of the file's internal text format": group OCR lines (with bboxes) into a structured document instead of one flat paragraph. - ocr-structure.js: deriveOcrStructure(lines) sorts by reading order (y,x), detects headings by relative font size (line height >= 1.35x median -> heading, level 1-3 by ratio), splits paragraphs on large vertical gaps (> 0.7x median), joins same-paragraph lines (CJK tight, latin spaced); falls back to a single paragraph when bbox geometry is missing. blocksFromOcrResult walks pages - png-ocr enhanceWithOCR now emits structured heading/paragraph blocks (replacing the old "join all 16 lines into one paragraph"); removed the dead helper + unused import - validated on the real product label: 16 lines -> 4 blocks (title becomes a heading, body grouped into paragraphs) - browser-transformer exports deriveOcrStructure / blocksFromOcrResult - scripts/ocr-structure-test.js (7 blocks) wired into npm test (28th) npm test (28) passes. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Make the OCR quality data (engine, confidence, grade, low-conf lines, skew, rotation, denoise) visible — it was computed but never shown. - fix: _runRepairCycle overwrote the OCR-stage modelReview with the repair engine's, dropping the ocr/ocrQuality sub-objects. Merge them back so the recognition quality reaches result.quality.modelReview - index.html: add an "OCR 识别质量" row (#verificationOcrRecognitionRow, hidden unless OCR ran) - app.js renderVerificationReport reads quality.modelReview.ocr + .ocrQuality and renders engine / lines / confidence / grade / low-confidence / skew / rotation / denoised, with grade-driven coloring - browser-smoke asserts the row; ocr-baseline adds a regression assertion that modelReview.ocr survives the default repair path npm test (28) passes. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…ification Bump version 2.2.0 -> 2.3.0 across package.json / tauri.conf.json / Cargo.toml / Cargo.lock (desktop-shell-test enforces the sync). - README: refreshed capabilities (real PP-OCRv5 OCR with orientation/skew/ denoise/structure/quality, LaTeX rendering, three-layer verification), roadmap, limitations, test/sample sections; badge -> 2.3.0 - CHANGELOG: [2.3.0] entry covering the OCR pipeline, LaTeX, verification, Repair Engine + model-cache, sample generator, the four latent-bug fixes, and the PP-OCRv5 direction pivot - RELEASE_NOTES_v2.3.0.md added - DEVELOPMENT_TASKS acceptance command -> trans2former-2.3.0 - gitignore the tesseract vendor too (reproducible, like onnxruntime/paddleocr) npm test (28), git diff --check, release:prepare (release/trans2former-2.3.0, git-ignored) all pass. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…lity, OCR blockId, reproducible models - sanitize tesseract vendor bundles (rewrite CDN defaults to same-origin, drop .map, post-assert) so local-security-test passes - security-center: mark PP-OCRv5 vendor-ready before ensureProbe so imported det+rec flips to available - make cls (direction classifier) optional across availability gates (runtime/pipeline already tolerated null cls) - assign stable ocr-block ids at append time + map OCR lines to blocks by trimmed-text containment so low-confidence repair targets resolve (png + scan-pdf); drop the never-incrementing scan-pdf loop - add reproducible 'npm run vendor:paddle' (pinned ppu-paddle-ocr-models + committed SHA-256 manifest), wired into release:prepare - parseCharDictionary: don't double-append space when dict already ends with one (fixes off-by-one vs ppu dict); never trim() so U+3000 token is preserved - resource-budget: correct public/vendor budget to account for tesseract (~30MB) it previously omitted - align README/.gitignore/PP_OCRV5_BROWSER_VERIFICATION docs to bundled+reproducible policy; update/extend tests Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 23cd73a68d
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| export const PADDLE_OCR_MODEL_FILES = Object.freeze(["det.onnx", "cls.onnx", "rec.onnx"]); | ||
| // 必选集:det(DB 检测)+ rec(CTC 识别)。cls(方向分类)为可选——管线运行时已容忍其 | ||
| // 缺失(clsSession 为 null 时跳过 180° 校正),故不纳入可用性闸门。 | ||
| export const PADDLE_OCR_REQUIRED_FILES = Object.freeze(["det.onnx", "rec.onnx"]); |
There was a problem hiding this comment.
Require the PP-OCR dictionary before advertising readiness
When PP-OCRv5 models are supplied through the Security Center, the UI only offers det.onnx, cls.onnx, and rec.onnx, and this readiness check marks the engine available after just det+rec. In that manual-import path, or if the bundled dict.txt fetch fails, recognize() falls back to dictionary = [], so ctcGreedyDecode cannot map any non-blank class to text and the OCR stage returns empty output despite the engine being selected as ready. Include dict.txt in the required set/import flow or fail readiness until it is present.
Useful? React with 👍 / 👎.
| // 先置位 vendor-ready(用户已选用 PP-OCRv5),再 probe;否则 ensureProbe 在 vendor | ||
| // 未置位时恒返回 false,状态永远翻不过去。真正的 onnxruntime 运行时加载仍在 | ||
| // recognize() 时把关。对齐 paddle-default-models.js / tesseract 导入流程的顺序。 | ||
| markPaddleOcrVendorReady(true); |
There was a problem hiding this comment.
Do not mark Paddle ready without the ONNX runtime vendor
In environments where the user imports det/rec but public/vendor/onnxruntime/ort.min.mjs is absent (the sync script explicitly skips when the optional dependency is missing, and the vendor dir is gitignored), this line still marks PP-OCR as vendor-ready; after ensureProbe() it outranks a working Tesseract engine, then recognize() fails with OCR_VENDOR_LOAD_FAILED and the OCR stage does not fall back. Keep Paddle unavailable unless the ONNX runtime bundle is actually loadable, or fall back to the next available engine on runtime-load failure.
Useful? React with 👍 / 👎.
背景
评审报告列出 4 项问题(2×P1、2×P2),已逐条在代码上复现并修复;
npm test现全绿(29 项通过,EXIT=0)。修复
sync-tesseract-vendor.js复制后清洗 bundle(cdn.jsdelivr.net→ 同源/vendor)、不再复制.map、清洗后断言无残留远程协议。CDN 默认值本是死代码(运行时恒以同源路径覆盖)。security-center在ensureProbe()前先markPaddleOcrVendorReady(true),对齐paddle-default-models范式。ocr-block-<绝对索引>id,新增mapLinesToBlockIds(单调游标+修剪文本包含)在 png/scan-pdf 回填;删掉扫描 PDF 不自增的死循环。npm run vendor:paddle(钉定ppu-paddle-ocr-models+ 入库 SHA-256 manifest),接入release:prepare;cls方向分类设为可选(运行时本就容忍 null)。文档(README/.gitignore/验证清单)对齐。连带修复(被 P1 红掩盖、此次解开)
parseCharDictionaryoff-by-one:字典已以空格行结尾时不重复追加空格;保留全角空格 U+3000(不可trim())。真实模型集成测试解出 "PAIN"(conf 0.991,C 对齐)。resource-budget预算账目修正:public/vendor上限此前漏算 tesseract(~30MB),据实调整。验证
npm run vendor:tesseract重生清洗版 →local-security-test通过。npm run vendor:paddle→ 下载 det/rec/dict + SHA-256 校验通过;离线优雅跳过。npm test全绿(29/29)。docs/PP_OCRV5_BROWSER_VERIFICATION.md手验。注意
.gitignore),由vendor:paddle钉定来源 + SHA-256 重建,随release:prepare打包。🤖 Generated with Claude Code