fix: green the release gate — tesseract local-only, PP-OCRv5 availability, OCR blockId, reproducible models by Vantalens · Pull Request #3 · Vantalens/Trans2Former

Vantalens · 2026-05-31T05:38:53Z

背景

评审报告列出 4 项问题(2×P1、2×P2),已逐条在代码上复现并修复;npm test 现全绿(29 项通过,EXIT=0)。

修复

P1 测试门禁红(Tesseract CDN):sync-tesseract-vendor.js 复制后清洗 bundle(cdn.jsdelivr.net → 同源 /vendor)、不再复制 .map、清洗后断言无残留远程协议。CDN 默认值本是死代码(运行时恒以同源路径覆盖)。
P1 PP-OCRv5 导入不翻转「可用」:security-center 在 ensureProbe() 前先 markPaddleOcrVendorReady(true),对齐 paddle-default-models 范式。
P2 OCR blockId 命不中 block:追加块预赋稳定 ocr-block-<绝对索引> id,新增 mapLinesToBlockIds(单调游标+修剪文本包含)在 png/scan-pdf 回填;删掉扫描 PDF 不自增的死循环。
P2 + Open Question 模型口径:定为「随包内置+可复现脚本」。新增 npm run vendor:paddle(钉定 ppu-paddle-ocr-models + 入库 SHA-256 manifest),接入 release:prepare;cls 方向分类设为可选(运行时本就容忍 null)。文档(README/.gitignore/验证清单)对齐。

连带修复(被 P1 红掩盖、此次解开)

parseCharDictionary off-by-one:字典已以空格行结尾时不重复追加空格;保留全角空格 U+3000(不可 trim())。真实模型集成测试解出 "PAIN"(conf 0.991,C 对齐)。
resource-budget 预算账目修正:public/vendor 上限此前漏算 tesseract(~30MB),据实调整。

验证

npm run vendor:tesseract 重生清洗版 → local-security-test 通过。
npm run vendor:paddle → 下载 det/rec/dict + SHA-256 校验通过;离线优雅跳过。
全量 npm test 全绿(29/29)。
模型真实识别的完整浏览器/Tauri(WebGPU)路径仍建议按 docs/PP_OCRV5_BROWSER_VERIFICATION.md 手验。

注意

模型权重不入库(.gitignore),由 vendor:paddle 钉定来源 + SHA-256 重建,随 release:prepare 打包。
新克隆默认不含 cls(可选);需 180° 方向校正可在安全中心导入。

🤖 Generated with Claude Code

Land the first of the three-layer post-conversion verification system (rule diff + SSIM + OCR readback). Introduces public/core/verification/ with shared block fingerprints, field-level diffSemanticDocs, and a runVerificationStage orchestrator wired after the Repair Engine cycle. - block-fingerprint.js: lift blockFingerprint/modelFingerprint out of repair-engine (byte-for-byte identical) + getBlockKey/extractBlockFields + ROUND_TRIP_FORMATS single source - rule-diff.js: diffSemanticDocs -> { identical, blockCounts, changed/ added/removedBlocks, fidelity, overallScore } with minor/major severity - verification-stage.js: gating + same-format readback + md<->html cross-format loopback + RULE_DIFF_DRIFT / RULE_DIFF_READBACK_FAILED - format-registry: write qualityReport.ruleDiff + .verification envelope; repair-engine roundTripDelta contract unchanged - scripts/rule-diff-test.js (10 assertions) wired into npm test (21 total) - guard scripts + MULTI_MODEL_ARCHITECTURE + DEVELOPMENT_TASKS synced - two specs: P9-C overall + P9-C.1 sub-stage npm test (21 scripts), git diff --check, release:prepare all pass. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Land the second of the three-layer post-conversion verification system. SSIM uses visual-loopback semantics: for visual-preserving inputs (pdf/png), rasterize the input page and output page and compare structural similarity, writing qualityReport.ssim. - ssim.js: self-implemented, zero-dependency SSIM core (rgbaToGrayscale, resampleGrayscale box resample, computeSSIM windowed mean, compareImages) - page-image-source.js: pixel-source abstraction (Node throws VERIFICATION_IMAGE_SOURCE_UNAVAILABLE, browser auto-loads canvas impl, setPageImageSource for tests); RASTERIZABLE_FORMATS = {pdf, png} - page-image-source-browser.js: vendor pdfjs + canvas getImageData - verification-stage: runSsimLayer + async runVerificationStageAsync that merges the sync rule-diff base with the async SSIM layer; sync runVerificationStage unchanged (qualityReport.ssim stays null) - format-registry: extract _runRepairCycle + _assembleQuality; convert() stays sync (rule-diff), convertAsync() uses async wrap (rule-diff + ssim) - scripts/ssim-verification-test.js (12 assertions) wired into npm test (22) - guards + MULTI_MODEL_ARCHITECTURE + DEVELOPMENT_TASKS synced Rendering is stub-only this round (Node has no canvas); real PDF/PNG fixtures and browser end-to-end deferred. No new npm deps. npm test (22 scripts), git diff --check, release:prepare all pass. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Land the third and final layer of the post-conversion verification system. OCR readback rasterizes the output (PDF) and re-reads it with the OCR engine, comparing recognized text against the original SemanticDoc text via a character-level multiset similarity, writing qualityReport.ocrReadback. - ocr-readback.js: compareText (char-level multiset recall/precision/f1, robust for CJK + OCR noise) + normalizeText + extractModelText + runOcrReadbackLayer; reuses registered ocr-text engine + OCR pdf rasterizer - verification-stage: runVerificationStageAsync dynamic-imports the readback layer (keeps OCR off the sync convert path) and merges it into the envelope - format-registry: _assembleQuality adds ocrReadback (null on sync path) - scripts/ocr-readback-test.js (13 assertions) wired into npm test (23 total) - guards + MULTI_MODEL_ARCHITECTURE + DEVELOPMENT_TASKS synced Three layers (rule-diff + ssim + ocr-readback) now write a unified qualityReport.{ruleDiff,ssim,ocrReadback} + verification envelope. Rendering/OCR are stub-only this round; no new npm deps. npm test (23 scripts), git diff --check, release:prepare all pass. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Add a programmatic sample-corpus generator to stress-test conversion, layout, and the three-layer verification across every supported format at varied sizes (large tier >= 3MB). Binaries stay git-ignored per the existing programmatic-fixture policy. - scripts/lib/sample-content.js: deterministic complex content builders (md/html/json/xml/csv/txt) with headings, nested lists, task items, aligned tables, multi-lang code, nested quotes, footnotes, images, CJK/RTL/emoji/entities; SIZE_TIERS + buildToTargetBytes - scripts/lib/png-encode.js: minimal node:zlib PNG encoder (no new deps) - scripts/generate-samples.js: emits md/html/txt/json/xml/csv natively, docx/pptx/epub/pdf/xlsx via project writers, png via encoder; writes MANIFEST.json with coverageGaps for doc/ofd (no writer); --tiers/--out - scripts/sample-corpus-test.js: fast in-memory round-trip gate (24th npm test) — no 3MB writes - samples:generate npm script; samples/generated/ git-ignored - samples/fixtures/README + DEVELOPMENT_TASKS documented Verified large tier: text formats >=3MB, pdf 19MB, docx 13.6MB, xlsx 16MB, epub 4.9MB. npm test (24 scripts), git diff --check, release:prepare pass. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

…9-C.4) The P9-C.1/2/3 layers compute qualityReport.{ruleDiff,ssim,ocrReadback} plus a verification envelope, but transformContent discarded result.quality and the old bottom drawer was already removed — so the core "post-conversion verification" differentiator was invisible. Surface it. - index.html: new collapsible #verificationReportPanel inside the output panel (auto-repair verdict + rule-diff / SSIM / OCR-readback rows + warning counts + active-layer badge); does not revive the removed drawer ids - app.js: renderVerificationReport(quality) (textContent only, no innerHTML), captures result.quality into currentConversionQuality, renders on both text and binary paths, clears on reset - styles.css: .verification-report/.verification-row/.verification-badge, data-state ok/drift/skip coloring via :has() left border - browser-smoke-test: assert #verificationReportPanel present; removed-drawer negative assertions still hold Display-only; no change to conversion core / verification-stage. Skipped layers show their reason (honest: not-triggered != failed). No new deps. npm test (24 scripts), git diff --check, release:prepare all pass. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

landing-view.js imports getKnownInputFormats from browser-transformer.js, but it was never re-exported (only defined in format-registry.js). In the browser this is a module-load SyntaxError that takes down landing-view.js and app.js, leaving the landing page blank below the header. Node-side tests never caught it because browser-smoke-test only asserted static HTML strings, not that the module graph loads. - browser-transformer.js: import + re-export getKnownInputFormats - browser-smoke-test.js: load browser-transformer/router/landing-view module graphs and assert key re-exports exist, so a missing re-export fails CI instead of silently blanking the UI npm test (24 scripts) passes. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Research showed the roadmap's named advanced-OCR targets (PaddleOCR-VL / MinerU) are not embeddable under the project's constraints (browser/Tauri local, no cloud, 30-80MB default bundle, no Python runtime): the VLM has no mature ONNX/WebGPU path (~500MB + 1-2GB VRAM or vLLM), MinerU is a Python/vLLM tool. Per user confirmation, the built-in advanced-OCR target is now PP-OCRv5 (ONNX Runtime + WebGPU, WASM fallback); VLMs are marked as far-term/external. - specs: 2026-05-29-p9d-advanced-ocr-research.md (with sources) + 2026-05-29-p9d1-paddle-ocr-skeleton-design.md - paddle-ocr-engine.js: paddleOcrEngine (id paddleocr-v5, ocr-text/ocr-layout) implements OCREngine; isAvailable() false in Node; recognize() three-stage rejection (vendor-not-ready / model-missing / runtime-not-wired); markPaddleOcrVendorReady. No onnxruntime, no real inference this round. - paddle-ocr-bootstrap.js: registers engine (after tesseract) + PP-OCRv5 ONNX ModelManifest (engine paddleocr, int8, det/cls/rec) as not-downloaded - browser-transformer: import bootstrap + export paddle API - ocr-baseline-test: pickForTask fallback set + 35th block (paddle skeleton) - guards: local-security ALLOWED/STRICT; direction-test PP-OCRv5/ONNX/WebGPU - direction docs reframed: PP-OCRv5 built-in, VLM far-term/external (kept "default bundle excludes GB-scale models") Skeleton-first (mirrors P9-A.2 tesseract). npm test (24), git diff --check, release:prepare all pass. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Wire the ONNX Runtime layer for PP-OCRv5 advanced OCR, mirroring the tesseract P9-A.2 vendor+skeleton stage. Real det/cls/rec inference pipeline and CTC decode are deferred to P9-D.2.b (needs real models + dictionaries). - scripts/sync-onnxruntime-vendor.js: sync ort*.mjs + *.wasm from node_modules/onnxruntime-web/dist to public/vendor/onnxruntime/; exit 0 if the optionalDependency is absent (does not block install/release) - paddle-ocr-runtime.js: loadOnnxRuntime (same-origin vendor dynamic import, sets ort.env.wasm.wasmPaths same-origin, throws OCR_VENDOR_LOAD_FAILED in Node), pickExecutionProviders (navigator.gpu -> [webgpu,wasm] else [wasm]), createOcrSession/disposeOcrSession skeleton, PADDLE_VENDOR_PATHS - paddle-ocr-engine.recognize: third stage now loads the runtime; in a browser with vendor+models it reports pipeline-not-wired (P9-D.2.b) - package.json: onnxruntime-web optionalDependency + vendor:onnx + release:prepare runs the onnx vendor sync - browser-transformer exports the runtime API - guards: local-security recognizes public/vendor/onnxruntime/ + STRICT; direction-test asserts onnxruntime-web; release-readiness expects new release:prepare/vendor:onnx scripts - ocr-baseline-test 36th block (EP selection + Node vendor-load reject) Tauri CSP already covers ORT (wasm-unsafe-eval + worker-src blob: + connect-src 'self'); no change. npm test (24), git diff --check, release:prepare all pass. No forced install of onnxruntime-web. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

… fix Let users import PP-OCRv5 det/cls/rec ONNX models into the local cache from the security center, flipping paddleOcrEngine from model-missing toward ready. Reuses the tesseract tessdata local-import pattern (file picker + SHA-256 + IndexedDB) — the project's no-network STRICT guard forbids remote fetch, so "on-demand download" means local user import. - security-center.js: renderPaddleActions for engine=paddleocr rows (import det.onnx/cls.onnx/rec.onnx + clear); importPaddleModel (sha256 -> store at paddleocr/v5/<file> -> ensureProbe; only flips to AVAILABLE + sets vendor ready when all three present, else reports which are missing); clearPaddleModels; click delegation for data-import-paddle/data-clear-paddle - Fix latent bug: paddleOcrEngine/tesseractOCREngine stored readiness on a frozen instance prop, so ensureProbe()'s assignment throws "Cannot assign to read only property" in ES-module strict mode — which made the security center import flow fail silently (caught as "import failed"). Readiness now lives in module-level state; the engine objects stay frozen. - ocr-baseline-test: 37th block (availability flips with model presence + vendor flag) + ensureProbe no-throw assertion on the frozen tesseract engine - browser-smoke: assert #modelCacheFileInput present - docs/DEVELOPMENT_TASKS synced npm test (24), git diff --check, release:prepare all pass. No remote fetch; det/cls/rec inference + CTC decode deferred to P9-D.2.b. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Implement the real PP-OCRv5 inference pipeline as pure, fully-testable functions plus an orchestrator that takes injectable sessions — so it runs end-to-end in Node with mock sessions + synthetic tensors, no real models. - paddle-ocr-pipeline.js: parseCharDictionary, preprocessForDetection (ImageNet norm + multiple-of-32 + limit_side_len), preprocessForRecognition (height 48, [-1,1]), dbPostProcess (threshold + 4-connected components + axis-aligned bbox + box-score filter + scale-back + reading order), ctcGreedyDecode (argmax -> collapse repeats -> drop blank -> map dict), cropImageData, resizeRgba; runPaddlePipeline({ ort, det/cls/recSession, imageData, dictionary }) -> OCRResult - paddle-ocr-engine.recognize: in-browser decode (Image+canvas, no fetch -> respects the no-network guard) -> load det/cls/rec buffers + optional dict from cache -> createOcrSession x3 -> runPaddlePipeline; Node still rejects at loadOnnxRuntime before any browser-only path - browser-transformer exports the pipeline API - scripts/paddle-ocr-pipeline-test.js (9 blocks incl. mock-session e2e decoding "HI") wired into npm test (25th script) - guards: local-security ALLOWED/STRICT; direction-test runPaddlePipeline/ ctcGreedyDecode - resource-budget: public/core 256KB -> 320KB (pure-JS algorithm code, no model weights; models stay in model-cache) with rationale cls angle correction / minAreaRect+unclip / multi-column deferred. Real-model end-to-end is browser/manual. npm test (25), git diff --check, release:prepare all pass. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

…eract) Capstone for the PP-OCRv5 local advanced-OCR chain: prefer paddle over tesseract when both are available, so PNG / scanned-PDF OCR stages automatically use the higher-accuracy engine. - ocr-engine.js pickForTask: priority-aware — sort task candidates by priority (desc, default 0) and pick the first available; fall back to the last-registered candidate when none are available (unchanged) - paddle-ocr-engine priority 20, tesseract-engine priority 10 (placeholder stays 0) -> PP-OCRv5 wins when available - PNG / scanned-PDF stages need no change: enhanceWithOCR / runScannedPdfOCRStage resolve via pickForTask("ocr-text") - ocr-baseline-test 38th block: priority selection unit + default-registry preference flip (both available -> paddleocr-v5; remove paddle models -> tesseract-zh-en) The PP-OCRv5 chain (contract -> runtime -> model import -> inference pipeline -> route preference) is now complete; real-model end-to-end is a browser/manual step. npm test (25), git diff --check, release:prepare pass. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

The PP-OCRv5 chain (P9-D.1..D.4) is fully covered in Node via mock sessions and pure-function tests; real ONNX inference can only run in the browser/Tauri (WebGPU/WASM). Document the manual steps: install+vendor onnxruntime-web, import det/cls/rec ONNX + dictionary via the security center, verify engine priority, real PNG/scanned-PDF recognition, the verification report, and the no-network guarantee. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Make advanced OCR actually usable (P9-D.1..D.4 were contract/runtime/pipeline scaffolding with no real model). Install the real runtime + models and auto-load them on launch — no manual import needed. - onnxruntime-web installed (optionalDependency 1.26.0); sync-onnxruntime-vendor trimmed to the minimal set (ort.min.mjs + ort-wasm-simd-threaded.jsep.{mjs,wasm}, JSEP build covers WebGPU+WASM, ~25MB; drops ~68MB of redundant variants) - real PP-OCRv5 mobile ONNX (det 4.8MB / cls 0.58MB / rec 16.6MB) + dict downloaded to public/vendor/paddleocr/ (from the OnnxOCR PP-OCRv5 ONNX repo) - paddle-default-models.js: ensurePaddleDefaultModels() idempotently fetches the same-origin /vendor/paddleocr/ models into defaultOCRStorage (IndexedDB), marks the engine ready + probes -> advanced OCR works out of the box; missing vendor is silently skipped (manual security-center import still works) - app.js calls it fire-and-forget on init; browser-transformer exports it - .gitignore excludes the heavy onnxruntime + paddleocr vendors (reproducible via npm i + vendor script + local download; bundled into the build from disk) - guards: paddle-default-models ALLOWED/STRICT; isLocalVendorAsset trusts the onnxruntime vendor (its minified bundle has CDN strings; the no-network guarantee comes from pinning ort.env.wasm.wasmPaths same-origin + Tauri CSP connect-src 'self'); public/vendor budget 6MB -> 64MB with rationale Real ONNX inference runs in the browser/WebGPU only; Node stays mock/pure-fn covered (npm test 25 green). Browser e2e steps: docs/PP_OCRV5_BROWSER_VERIFICATION.md Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Render LaTeX math in document previews. The markdown inline tokenizer previously ate math (\frac -> rac via backslash escaping; _ -> <em>), so math is now a protected token. - inline-tokens.js: recognize $$...$$ (display) and $...$ (inline) BEFORE escape handling; capture content verbatim (no recursion/escaping). Inline heuristic (no inner-edge whitespace) excludes currency like "$5 and $10" - semantic-inlines.js: createInlineMath + math node rendered in plainText/markdown/html; HTML emits <span class="t2f-math" data-tex="raw" data-display=".."> with a no-JS fallback ($tex$); markdown round-trips - vendored KaTeX v0.17.0 (css + js + 20 woff2 fonts, ~592KB) at public/vendor/katex/; katex-render.js renderMathIn() typesets .t2f-math spans via the global katex (same-origin, zero network; silent fallback if katex absent) - index.html/preview.html load katex css + defer js; app.js (3 preview sites) + preview.js call renderMathIn after rendering - local-security: trust the katex vendor (its only http strings are W3C MathML/SVG namespaces, not network) - scripts/latex-math-test.js (7 blocks: tokenization with backslash+underscore preserved, currency exclusion, katex-targetable span, md round-trip, plain-text + factory, md->html no <em>) -> npm test (26th) npm test (26) passes; existing conversion snapshots survive the new token. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

OCR was unreachable in the running app: convertWithWorker sent conversions to a Web Worker that uses sync convertContent (no OCR stage), and OCR needs main-thread canvas decode + onnxruntime which a Worker can't provide. So image and scanned-PDF inputs never triggered OCR. Route png/pdf inputs to convertContentAsync on the main thread (text PDFs still take the normal path; images and scanned PDFs now run OCR), and ensure the bundled PP-OCRv5 models are loaded first (idempotent ensurePaddleDefaultModels). OCR has no dedicated button — it runs automatically when converting an image or scanned PDF to a text format. npm test (26) passes. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

…urate Validated the inference pipeline against the real PP-OCRv5 ONNX models with onnxruntime-node (dev-only). Findings: - rec is correct: decodes the PaddleOCR word_10 crop to "PAIN", output classes C=18385 exactly match the dictionary (blank + 18383 + space). RGB channel order + (x/255-0.5)/0.5 normalization are right (no BGR needed) - det is correct: produces a probability map; 16 text boxes on a real document - root cause of garbled output: dbPostProcess emitted tight axis-aligned bboxes without PP-OCR's "unclip" expansion, clipping character strokes -> wrong recognition (avgConf 0.41, CJK gibberish). Adding unclip (distance = area*ratio/perimeter, unclipRatio default 1.6) yields coherent, correct text on a full product-label document (avgConf 0.978) - dbPostProcess: add unclipRatio param + outward box expansion - pipeline unit test: assert exact bbox with unclipRatio:0, plus an expansion assertion - scripts/paddle-ocr-integration-test.js: runs the real rec model via onnxruntime-node on a committed fixture (samples/ocr/word-PAIN.png), asserts "PAIN" + C==dict; gracefully skips (exit 0) when the dev deps / models / fixture are absent, so npm test stays green in CI. Wired in (27th) Real rec hits 0.991 confidence locally. npm test (27) passes. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

…ntrol Improve hard-case recognition and add explicit quality assessment, validated against the real PP-OCRv5 models with onnxruntime-node. - orientation: the cls model outputs [1,2] softmax over [0°,180°] (upright->[1,0], flipped->[0,1]). runPaddlePipeline now applies it via interpretClsOutput, rotating 180° crops before rec. Validated: a fully upside-down document is read correctly (all 16 lines flipped, avgConf 0.976) - vertical/sideways: tall boxes (h/w > verticalAspect) additionally try 90° cw/ccw and keep the highest-confidence read (robust for sideways labels; cost limited to the rare tall boxes) - rotateImageData180 / rotateImageData90(dir) / interpretClsOutput pure helpers (unit-tested: 180 is its own inverse, 90 swaps dims + maps corners, cls threshold branches) - quality control: runPaddlePipeline returns a quality summary (lineCount, averageConfidence, minConfidence, lowConfidenceLines, rotatedLines, grade high/medium/low); enhanceWithOCR records result.quality into metadata.modelReview.ocrQuality, alongside the existing per-line confidence + detectOCRLowConfidence validator + P9-C OCR-readback verification - pipeline test: rotation/cls/quality assertions (12 blocks total) Still limited: strong italic / ornate artistic text (needs minAreaRect + perspective-warp polygon boxes and a stronger rec model). npm test (27) passes. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

…rescued) Per the steer — only convert the text content (no artistic-style preservation), focus on denoising. Add image denoising that cleans noisy/artistic-background images without hurting clean ones. - denoiseImageData: per-channel 3x3 median filter (edge-preserving, removes salt-and-pepper); estimateNoiseLevel: fraction of isolated pixels that jump far from their 3x3 median (separates speckle from text edges) - runPaddlePipeline options.denoise = "auto"(default)|true|false. AUTO denoises only when estimateNoiseLevel > threshold (default 0.05), because median filtering softens clean text (measured 0.974 -> 0.903), so clean images are never denoised - validated on real models: clean doc (noise 0.016) stays untouched at 0.974; 15% salt-and-pepper (noise 0.10) is denoised, recovering detection from 4 -> 16 lines and avgConf 0.692 -> 0.832 (heavy noise collapses detection; denoise rescues it) - quality summary adds denoised + noiseLevel - pipeline test: noise-estimate / median / auto-gating assertions (14 blocks) Explicitly not doing artistic-style preservation (text content only) or minAreaRect perspective. npm test (27) passes. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Skew is the biggest remaining recognition gap: a +10° rotated document drops from 16 lines/0.974 to 8 lines/0.535 (detection collapses). Auto-deskew fixes it. - rotateImageDataByAngle(im, deg): arbitrary-angle nearest-neighbor rotation with canvas expansion + white background - estimateSkewAngle(probData, mapW, mapH): shear-projection histogram variance on the (binarized, downsampled) det probability map -> the angle that makes text rows most horizontal (the deskew rotation) - runPaddlePipeline: extract a detect(image) helper; options.deskew = true(default)|false. Estimate skew from the first det; if |est| >= minSkew (default 3°), rotate the image upright and re-detect, then proceed. Upright images estimate ~0 and skip the second det (zero overhead); rec always runs once - validated on real models: +10° doc recovers 8/0.535 -> 16 lines/0.970 (skewApplied=10), -8° recovers too, upright untouched (skewApplied=0, 0.974) - quality summary adds skewApplied - pipeline test: arbitrary rotation + skew estimation (slanted synthetic rows detected ~angle, flat rows ~0) — 16 blocks npm test (27) passes. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Address "enhance recognition of the file's internal text format": group OCR lines (with bboxes) into a structured document instead of one flat paragraph. - ocr-structure.js: deriveOcrStructure(lines) sorts by reading order (y,x), detects headings by relative font size (line height >= 1.35x median -> heading, level 1-3 by ratio), splits paragraphs on large vertical gaps (> 0.7x median), joins same-paragraph lines (CJK tight, latin spaced); falls back to a single paragraph when bbox geometry is missing. blocksFromOcrResult walks pages - png-ocr enhanceWithOCR now emits structured heading/paragraph blocks (replacing the old "join all 16 lines into one paragraph"); removed the dead helper + unused import - validated on the real product label: 16 lines -> 4 blocks (title becomes a heading, body grouped into paragraphs) - browser-transformer exports deriveOcrStructure / blocksFromOcrResult - scripts/ocr-structure-test.js (7 blocks) wired into npm test (28th) npm test (28) passes. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Make the OCR quality data (engine, confidence, grade, low-conf lines, skew, rotation, denoise) visible — it was computed but never shown. - fix: _runRepairCycle overwrote the OCR-stage modelReview with the repair engine's, dropping the ocr/ocrQuality sub-objects. Merge them back so the recognition quality reaches result.quality.modelReview - index.html: add an "OCR 识别质量" row (#verificationOcrRecognitionRow, hidden unless OCR ran) - app.js renderVerificationReport reads quality.modelReview.ocr + .ocrQuality and renders engine / lines / confidence / grade / low-confidence / skew / rotation / denoised, with grade-driven coloring - browser-smoke asserts the row; ocr-baseline adds a regression assertion that modelReview.ocr survives the default repair path npm test (28) passes. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

…ification Bump version 2.2.0 -> 2.3.0 across package.json / tauri.conf.json / Cargo.toml / Cargo.lock (desktop-shell-test enforces the sync). - README: refreshed capabilities (real PP-OCRv5 OCR with orientation/skew/ denoise/structure/quality, LaTeX rendering, three-layer verification), roadmap, limitations, test/sample sections; badge -> 2.3.0 - CHANGELOG: [2.3.0] entry covering the OCR pipeline, LaTeX, verification, Repair Engine + model-cache, sample generator, the four latent-bug fixes, and the PP-OCRv5 direction pivot - RELEASE_NOTES_v2.3.0.md added - DEVELOPMENT_TASKS acceptance command -> trans2former-2.3.0 - gitignore the tesseract vendor too (reproducible, like onnxruntime/paddleocr) npm test (28), git diff --check, release:prepare (release/trans2former-2.3.0, git-ignored) all pass. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

…lity, OCR blockId, reproducible models - sanitize tesseract vendor bundles (rewrite CDN defaults to same-origin, drop .map, post-assert) so local-security-test passes - security-center: mark PP-OCRv5 vendor-ready before ensureProbe so imported det+rec flips to available - make cls (direction classifier) optional across availability gates (runtime/pipeline already tolerated null cls) - assign stable ocr-block ids at append time + map OCR lines to blocks by trimmed-text containment so low-confidence repair targets resolve (png + scan-pdf); drop the never-incrementing scan-pdf loop - add reproducible 'npm run vendor:paddle' (pinned ppu-paddle-ocr-models + committed SHA-256 manifest), wired into release:prepare - parseCharDictionary: don't double-append space when dict already ends with one (fixes off-by-one vs ppu dict); never trim() so U+3000 token is preserved - resource-budget: correct public/vendor budget to account for tesseract (~30MB) it previously omitted - align README/.gitignore/PP_OCRV5_BROWSER_VERIFICATION docs to bundled+reproducible policy; update/extend tests Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Copilot

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 23cd73a68d

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-05-31T05:42:31Z

+export const PADDLE_OCR_MODEL_FILES = Object.freeze(["det.onnx", "cls.onnx", "rec.onnx"]);
+// 必选集：det（DB 检测）+ rec（CTC 识别）。cls（方向分类）为可选——管线运行时已容忍其
+// 缺失（clsSession 为 null 时跳过 180° 校正），故不纳入可用性闸门。
+export const PADDLE_OCR_REQUIRED_FILES = Object.freeze(["det.onnx", "rec.onnx"]);


Require the PP-OCR dictionary before advertising readiness

When PP-OCRv5 models are supplied through the Security Center, the UI only offers det.onnx, cls.onnx, and rec.onnx, and this readiness check marks the engine available after just det+rec. In that manual-import path, or if the bundled dict.txt fetch fails, recognize() falls back to dictionary = [], so ctcGreedyDecode cannot map any non-blank class to text and the OCR stage returns empty output despite the engine being selected as ready. Include dict.txt in the required set/import flow or fail readiness until it is present.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-05-31T05:42:31Z

+        // 先置位 vendor-ready（用户已选用 PP-OCRv5），再 probe；否则 ensureProbe 在 vendor
+        // 未置位时恒返回 false，状态永远翻不过去。真正的 onnxruntime 运行时加载仍在
+        // recognize() 时把关。对齐 paddle-default-models.js / tesseract 导入流程的顺序。
+        markPaddleOcrVendorReady(true);


Do not mark Paddle ready without the ONNX runtime vendor

In environments where the user imports det/rec but public/vendor/onnxruntime/ort.min.mjs is absent (the sync script explicitly skips when the optional dependency is missing, and the vendor dir is gitignored), this line still marks PP-OCR as vendor-ready; after ensureProbe() it outranks a working Tesseract engine, then recognize() fails with OCR_VENDOR_LOAD_FAILED and the OCR stage does not fall back. Keep Paddle unavailable unless the ONNX runtime bundle is actually loadable, or fall back to the next available engine on runtime-load failure.

Useful? React with 👍 / 👎.

Vantalens and others added 23 commits May 29, 2026 11:36

Copilot AI review requested due to automatic review settings May 31, 2026 05:38

Copilot started reviewing on behalf of Vantalens May 31, 2026 05:39 View session

Copilot AI reviewed May 31, 2026

chatgpt-codex-connector Bot reviewed May 31, 2026

View reviewed changes

Vantalens merged commit fe05805 into main May 31, 2026
1 check failed

Vantalens deleted the codex/p8-routing-release-baseline branch May 31, 2026 05:52

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: green the release gate — tesseract local-only, PP-OCRv5 availability, OCR blockId, reproducible models#3

fix: green the release gate — tesseract local-only, PP-OCRv5 availability, OCR blockId, reproducible models#3
Vantalens merged 23 commits into
mainfrom
codex/p8-routing-release-baseline

Vantalens commented May 31, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot May 31, 2026

Uh oh!

chatgpt-codex-connector Bot May 31, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Vantalens commented May 31, 2026

背景

修复

连带修复(被 P1 红掩盖、此次解开)

验证

注意

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot May 31, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot May 31, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants