Skip to content

fix: green the release gate — tesseract local-only, PP-OCRv5 availability, OCR blockId, reproducible models#3

Merged
Vantalens merged 23 commits into
mainfrom
codex/p8-routing-release-baseline
May 31, 2026
Merged

fix: green the release gate — tesseract local-only, PP-OCRv5 availability, OCR blockId, reproducible models#3
Vantalens merged 23 commits into
mainfrom
codex/p8-routing-release-baseline

Conversation

@Vantalens

Copy link
Copy Markdown
Owner

背景

评审报告列出 4 项问题(2×P1、2×P2),已逐条在代码上复现并修复;npm test 现全绿(29 项通过,EXIT=0)。

修复

  • P1 测试门禁红(Tesseract CDN):sync-tesseract-vendor.js 复制后清洗 bundle(cdn.jsdelivr.net → 同源 /vendor)、不再复制 .map、清洗后断言无残留远程协议。CDN 默认值本是死代码(运行时恒以同源路径覆盖)。
  • P1 PP-OCRv5 导入不翻转「可用」:security-centerensureProbe() 前先 markPaddleOcrVendorReady(true),对齐 paddle-default-models 范式。
  • P2 OCR blockId 命不中 block:追加块预赋稳定 ocr-block-<绝对索引> id,新增 mapLinesToBlockIds(单调游标+修剪文本包含)在 png/scan-pdf 回填;删掉扫描 PDF 不自增的死循环。
  • P2 + Open Question 模型口径:定为「随包内置+可复现脚本」。新增 npm run vendor:paddle(钉定 ppu-paddle-ocr-models + 入库 SHA-256 manifest),接入 release:prepare;cls 方向分类设为可选(运行时本就容忍 null)。文档(README/.gitignore/验证清单)对齐。

连带修复(被 P1 红掩盖、此次解开)

  • parseCharDictionary off-by-one:字典已以空格行结尾时不重复追加空格;保留全角空格 U+3000(不可 trim())。真实模型集成测试解出 "PAIN"(conf 0.991,C 对齐)。
  • resource-budget 预算账目修正:public/vendor 上限此前漏算 tesseract(~30MB),据实调整。

验证

  • npm run vendor:tesseract 重生清洗版 → local-security-test 通过。
  • npm run vendor:paddle → 下载 det/rec/dict + SHA-256 校验通过;离线优雅跳过。
  • 全量 npm test 全绿(29/29)。
  • 模型真实识别的完整浏览器/Tauri(WebGPU)路径仍建议按 docs/PP_OCRV5_BROWSER_VERIFICATION.md 手验。

注意

  • 模型权重不入库(.gitignore),由 vendor:paddle 钉定来源 + SHA-256 重建,随 release:prepare 打包。
  • 新克隆默认不含 cls(可选);需 180° 方向校正可在安全中心导入。

🤖 Generated with Claude Code

Vantalens and others added 23 commits May 29, 2026 11:36
Land the first of the three-layer post-conversion verification system
(rule diff + SSIM + OCR readback). Introduces public/core/verification/
with shared block fingerprints, field-level diffSemanticDocs, and a
runVerificationStage orchestrator wired after the Repair Engine cycle.

- block-fingerprint.js: lift blockFingerprint/modelFingerprint out of
  repair-engine (byte-for-byte identical) + getBlockKey/extractBlockFields
  + ROUND_TRIP_FORMATS single source
- rule-diff.js: diffSemanticDocs -> { identical, blockCounts, changed/
  added/removedBlocks, fidelity, overallScore } with minor/major severity
- verification-stage.js: gating + same-format readback + md<->html
  cross-format loopback + RULE_DIFF_DRIFT / RULE_DIFF_READBACK_FAILED
- format-registry: write qualityReport.ruleDiff + .verification envelope;
  repair-engine roundTripDelta contract unchanged
- scripts/rule-diff-test.js (10 assertions) wired into npm test (21 total)
- guard scripts + MULTI_MODEL_ARCHITECTURE + DEVELOPMENT_TASKS synced
- two specs: P9-C overall + P9-C.1 sub-stage

npm test (21 scripts), git diff --check, release:prepare all pass.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Land the second of the three-layer post-conversion verification system.
SSIM uses visual-loopback semantics: for visual-preserving inputs
(pdf/png), rasterize the input page and output page and compare
structural similarity, writing qualityReport.ssim.

- ssim.js: self-implemented, zero-dependency SSIM core (rgbaToGrayscale,
  resampleGrayscale box resample, computeSSIM windowed mean, compareImages)
- page-image-source.js: pixel-source abstraction (Node throws
  VERIFICATION_IMAGE_SOURCE_UNAVAILABLE, browser auto-loads canvas impl,
  setPageImageSource for tests); RASTERIZABLE_FORMATS = {pdf, png}
- page-image-source-browser.js: vendor pdfjs + canvas getImageData
- verification-stage: runSsimLayer + async runVerificationStageAsync that
  merges the sync rule-diff base with the async SSIM layer; sync
  runVerificationStage unchanged (qualityReport.ssim stays null)
- format-registry: extract _runRepairCycle + _assembleQuality; convert()
  stays sync (rule-diff), convertAsync() uses async wrap (rule-diff + ssim)
- scripts/ssim-verification-test.js (12 assertions) wired into npm test (22)
- guards + MULTI_MODEL_ARCHITECTURE + DEVELOPMENT_TASKS synced

Rendering is stub-only this round (Node has no canvas); real PDF/PNG
fixtures and browser end-to-end deferred. No new npm deps.

npm test (22 scripts), git diff --check, release:prepare all pass.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Land the third and final layer of the post-conversion verification system.
OCR readback rasterizes the output (PDF) and re-reads it with the OCR
engine, comparing recognized text against the original SemanticDoc text
via a character-level multiset similarity, writing qualityReport.ocrReadback.

- ocr-readback.js: compareText (char-level multiset recall/precision/f1,
  robust for CJK + OCR noise) + normalizeText + extractModelText +
  runOcrReadbackLayer; reuses registered ocr-text engine + OCR pdf rasterizer
- verification-stage: runVerificationStageAsync dynamic-imports the readback
  layer (keeps OCR off the sync convert path) and merges it into the envelope
- format-registry: _assembleQuality adds ocrReadback (null on sync path)
- scripts/ocr-readback-test.js (13 assertions) wired into npm test (23 total)
- guards + MULTI_MODEL_ARCHITECTURE + DEVELOPMENT_TASKS synced

Three layers (rule-diff + ssim + ocr-readback) now write a unified
qualityReport.{ruleDiff,ssim,ocrReadback} + verification envelope.
Rendering/OCR are stub-only this round; no new npm deps.

npm test (23 scripts), git diff --check, release:prepare all pass.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Add a programmatic sample-corpus generator to stress-test conversion,
layout, and the three-layer verification across every supported format
at varied sizes (large tier >= 3MB). Binaries stay git-ignored per the
existing programmatic-fixture policy.

- scripts/lib/sample-content.js: deterministic complex content builders
  (md/html/json/xml/csv/txt) with headings, nested lists, task items,
  aligned tables, multi-lang code, nested quotes, footnotes, images,
  CJK/RTL/emoji/entities; SIZE_TIERS + buildToTargetBytes
- scripts/lib/png-encode.js: minimal node:zlib PNG encoder (no new deps)
- scripts/generate-samples.js: emits md/html/txt/json/xml/csv natively,
  docx/pptx/epub/pdf/xlsx via project writers, png via encoder; writes
  MANIFEST.json with coverageGaps for doc/ofd (no writer); --tiers/--out
- scripts/sample-corpus-test.js: fast in-memory round-trip gate (24th
  npm test) — no 3MB writes
- samples:generate npm script; samples/generated/ git-ignored
- samples/fixtures/README + DEVELOPMENT_TASKS documented

Verified large tier: text formats >=3MB, pdf 19MB, docx 13.6MB, xlsx 16MB,
epub 4.9MB. npm test (24 scripts), git diff --check, release:prepare pass.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…9-C.4)

The P9-C.1/2/3 layers compute qualityReport.{ruleDiff,ssim,ocrReadback}
plus a verification envelope, but transformContent discarded result.quality
and the old bottom drawer was already removed — so the core "post-conversion
verification" differentiator was invisible. Surface it.

- index.html: new collapsible #verificationReportPanel inside the output
  panel (auto-repair verdict + rule-diff / SSIM / OCR-readback rows + warning
  counts + active-layer badge); does not revive the removed drawer ids
- app.js: renderVerificationReport(quality) (textContent only, no innerHTML),
  captures result.quality into currentConversionQuality, renders on both
  text and binary paths, clears on reset
- styles.css: .verification-report/.verification-row/.verification-badge,
  data-state ok/drift/skip coloring via :has() left border
- browser-smoke-test: assert #verificationReportPanel present; removed-drawer
  negative assertions still hold

Display-only; no change to conversion core / verification-stage. Skipped
layers show their reason (honest: not-triggered != failed). No new deps.

npm test (24 scripts), git diff --check, release:prepare all pass.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
landing-view.js imports getKnownInputFormats from browser-transformer.js,
but it was never re-exported (only defined in format-registry.js). In the
browser this is a module-load SyntaxError that takes down landing-view.js
and app.js, leaving the landing page blank below the header. Node-side
tests never caught it because browser-smoke-test only asserted static HTML
strings, not that the module graph loads.

- browser-transformer.js: import + re-export getKnownInputFormats
- browser-smoke-test.js: load browser-transformer/router/landing-view module
  graphs and assert key re-exports exist, so a missing re-export fails CI
  instead of silently blanking the UI

npm test (24 scripts) passes.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Research showed the roadmap's named advanced-OCR targets (PaddleOCR-VL /
MinerU) are not embeddable under the project's constraints (browser/Tauri
local, no cloud, 30-80MB default bundle, no Python runtime): the VLM has no
mature ONNX/WebGPU path (~500MB + 1-2GB VRAM or vLLM), MinerU is a
Python/vLLM tool. Per user confirmation, the built-in advanced-OCR target
is now PP-OCRv5 (ONNX Runtime + WebGPU, WASM fallback); VLMs are marked as
far-term/external.

- specs: 2026-05-29-p9d-advanced-ocr-research.md (with sources) +
  2026-05-29-p9d1-paddle-ocr-skeleton-design.md
- paddle-ocr-engine.js: paddleOcrEngine (id paddleocr-v5, ocr-text/ocr-layout)
  implements OCREngine; isAvailable() false in Node; recognize() three-stage
  rejection (vendor-not-ready / model-missing / runtime-not-wired);
  markPaddleOcrVendorReady. No onnxruntime, no real inference this round.
- paddle-ocr-bootstrap.js: registers engine (after tesseract) + PP-OCRv5
  ONNX ModelManifest (engine paddleocr, int8, det/cls/rec) as not-downloaded
- browser-transformer: import bootstrap + export paddle API
- ocr-baseline-test: pickForTask fallback set + 35th block (paddle skeleton)
- guards: local-security ALLOWED/STRICT; direction-test PP-OCRv5/ONNX/WebGPU
- direction docs reframed: PP-OCRv5 built-in, VLM far-term/external
  (kept "default bundle excludes GB-scale models")

Skeleton-first (mirrors P9-A.2 tesseract). npm test (24), git diff --check,
release:prepare all pass.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Wire the ONNX Runtime layer for PP-OCRv5 advanced OCR, mirroring the
tesseract P9-A.2 vendor+skeleton stage. Real det/cls/rec inference pipeline
and CTC decode are deferred to P9-D.2.b (needs real models + dictionaries).

- scripts/sync-onnxruntime-vendor.js: sync ort*.mjs + *.wasm from
  node_modules/onnxruntime-web/dist to public/vendor/onnxruntime/; exit 0
  if the optionalDependency is absent (does not block install/release)
- paddle-ocr-runtime.js: loadOnnxRuntime (same-origin vendor dynamic import,
  sets ort.env.wasm.wasmPaths same-origin, throws OCR_VENDOR_LOAD_FAILED in
  Node), pickExecutionProviders (navigator.gpu -> [webgpu,wasm] else [wasm]),
  createOcrSession/disposeOcrSession skeleton, PADDLE_VENDOR_PATHS
- paddle-ocr-engine.recognize: third stage now loads the runtime; in a browser
  with vendor+models it reports pipeline-not-wired (P9-D.2.b)
- package.json: onnxruntime-web optionalDependency + vendor:onnx +
  release:prepare runs the onnx vendor sync
- browser-transformer exports the runtime API
- guards: local-security recognizes public/vendor/onnxruntime/ + STRICT;
  direction-test asserts onnxruntime-web; release-readiness expects new
  release:prepare/vendor:onnx scripts
- ocr-baseline-test 36th block (EP selection + Node vendor-load reject)

Tauri CSP already covers ORT (wasm-unsafe-eval + worker-src blob: +
connect-src 'self'); no change. npm test (24), git diff --check,
release:prepare all pass. No forced install of onnxruntime-web.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
… fix

Let users import PP-OCRv5 det/cls/rec ONNX models into the local cache from
the security center, flipping paddleOcrEngine from model-missing toward
ready. Reuses the tesseract tessdata local-import pattern (file picker +
SHA-256 + IndexedDB) — the project's no-network STRICT guard forbids remote
fetch, so "on-demand download" means local user import.

- security-center.js: renderPaddleActions for engine=paddleocr rows (import
  det.onnx/cls.onnx/rec.onnx + clear); importPaddleModel (sha256 -> store at
  paddleocr/v5/<file> -> ensureProbe; only flips to AVAILABLE + sets vendor
  ready when all three present, else reports which are missing);
  clearPaddleModels; click delegation for data-import-paddle/data-clear-paddle

- Fix latent bug: paddleOcrEngine/tesseractOCREngine stored readiness on a
  frozen instance prop, so ensureProbe()'s assignment throws "Cannot assign
  to read only property" in ES-module strict mode — which made the security
  center import flow fail silently (caught as "import failed"). Readiness now
  lives in module-level state; the engine objects stay frozen.

- ocr-baseline-test: 37th block (availability flips with model presence +
  vendor flag) + ensureProbe no-throw assertion on the frozen tesseract engine
- browser-smoke: assert #modelCacheFileInput present
- docs/DEVELOPMENT_TASKS synced

npm test (24), git diff --check, release:prepare all pass. No remote fetch;
det/cls/rec inference + CTC decode deferred to P9-D.2.b.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Implement the real PP-OCRv5 inference pipeline as pure, fully-testable
functions plus an orchestrator that takes injectable sessions — so it runs
end-to-end in Node with mock sessions + synthetic tensors, no real models.

- paddle-ocr-pipeline.js: parseCharDictionary, preprocessForDetection
  (ImageNet norm + multiple-of-32 + limit_side_len), preprocessForRecognition
  (height 48, [-1,1]), dbPostProcess (threshold + 4-connected components +
  axis-aligned bbox + box-score filter + scale-back + reading order),
  ctcGreedyDecode (argmax -> collapse repeats -> drop blank -> map dict),
  cropImageData, resizeRgba; runPaddlePipeline({ ort, det/cls/recSession,
  imageData, dictionary }) -> OCRResult
- paddle-ocr-engine.recognize: in-browser decode (Image+canvas, no fetch ->
  respects the no-network guard) -> load det/cls/rec buffers + optional dict
  from cache -> createOcrSession x3 -> runPaddlePipeline; Node still rejects
  at loadOnnxRuntime before any browser-only path
- browser-transformer exports the pipeline API
- scripts/paddle-ocr-pipeline-test.js (9 blocks incl. mock-session e2e
  decoding "HI") wired into npm test (25th script)
- guards: local-security ALLOWED/STRICT; direction-test runPaddlePipeline/
  ctcGreedyDecode
- resource-budget: public/core 256KB -> 320KB (pure-JS algorithm code, no
  model weights; models stay in model-cache) with rationale

cls angle correction / minAreaRect+unclip / multi-column deferred. Real-model
end-to-end is browser/manual. npm test (25), git diff --check,
release:prepare all pass.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…eract)

Capstone for the PP-OCRv5 local advanced-OCR chain: prefer paddle over
tesseract when both are available, so PNG / scanned-PDF OCR stages
automatically use the higher-accuracy engine.

- ocr-engine.js pickForTask: priority-aware — sort task candidates by
  priority (desc, default 0) and pick the first available; fall back to the
  last-registered candidate when none are available (unchanged)
- paddle-ocr-engine priority 20, tesseract-engine priority 10
  (placeholder stays 0) -> PP-OCRv5 wins when available
- PNG / scanned-PDF stages need no change: enhanceWithOCR /
  runScannedPdfOCRStage resolve via pickForTask("ocr-text")
- ocr-baseline-test 38th block: priority selection unit + default-registry
  preference flip (both available -> paddleocr-v5; remove paddle models ->
  tesseract-zh-en)

The PP-OCRv5 chain (contract -> runtime -> model import -> inference
pipeline -> route preference) is now complete; real-model end-to-end is a
browser/manual step. npm test (25), git diff --check, release:prepare pass.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The PP-OCRv5 chain (P9-D.1..D.4) is fully covered in Node via mock sessions
and pure-function tests; real ONNX inference can only run in the browser/Tauri
(WebGPU/WASM). Document the manual steps: install+vendor onnxruntime-web,
import det/cls/rec ONNX + dictionary via the security center, verify engine
priority, real PNG/scanned-PDF recognition, the verification report, and the
no-network guarantee.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Make advanced OCR actually usable (P9-D.1..D.4 were contract/runtime/pipeline
scaffolding with no real model). Install the real runtime + models and
auto-load them on launch — no manual import needed.

- onnxruntime-web installed (optionalDependency 1.26.0); sync-onnxruntime-vendor
  trimmed to the minimal set (ort.min.mjs + ort-wasm-simd-threaded.jsep.{mjs,wasm},
  JSEP build covers WebGPU+WASM, ~25MB; drops ~68MB of redundant variants)
- real PP-OCRv5 mobile ONNX (det 4.8MB / cls 0.58MB / rec 16.6MB) + dict
  downloaded to public/vendor/paddleocr/ (from the OnnxOCR PP-OCRv5 ONNX repo)
- paddle-default-models.js: ensurePaddleDefaultModels() idempotently fetches
  the same-origin /vendor/paddleocr/ models into defaultOCRStorage (IndexedDB),
  marks the engine ready + probes -> advanced OCR works out of the box; missing
  vendor is silently skipped (manual security-center import still works)
- app.js calls it fire-and-forget on init; browser-transformer exports it
- .gitignore excludes the heavy onnxruntime + paddleocr vendors (reproducible
  via npm i + vendor script + local download; bundled into the build from disk)
- guards: paddle-default-models ALLOWED/STRICT; isLocalVendorAsset trusts the
  onnxruntime vendor (its minified bundle has CDN strings; the no-network
  guarantee comes from pinning ort.env.wasm.wasmPaths same-origin + Tauri CSP
  connect-src 'self'); public/vendor budget 6MB -> 64MB with rationale

Real ONNX inference runs in the browser/WebGPU only; Node stays mock/pure-fn
covered (npm test 25 green). Browser e2e steps: docs/PP_OCRV5_BROWSER_VERIFICATION.md

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Render LaTeX math in document previews. The markdown inline tokenizer
previously ate math (\frac -> rac via backslash escaping; _ -> <em>), so
math is now a protected token.

- inline-tokens.js: recognize $$...$$ (display) and $...$ (inline) BEFORE
  escape handling; capture content verbatim (no recursion/escaping). Inline
  heuristic (no inner-edge whitespace) excludes currency like "$5 and $10"
- semantic-inlines.js: createInlineMath + math node rendered in
  plainText/markdown/html; HTML emits <span class="t2f-math" data-tex="raw"
  data-display=".."> with a no-JS fallback ($tex$); markdown round-trips
- vendored KaTeX v0.17.0 (css + js + 20 woff2 fonts, ~592KB) at
  public/vendor/katex/; katex-render.js renderMathIn() typesets .t2f-math
  spans via the global katex (same-origin, zero network; silent fallback if
  katex absent)
- index.html/preview.html load katex css + defer js; app.js (3 preview
  sites) + preview.js call renderMathIn after rendering
- local-security: trust the katex vendor (its only http strings are W3C
  MathML/SVG namespaces, not network)
- scripts/latex-math-test.js (7 blocks: tokenization with backslash+underscore
  preserved, currency exclusion, katex-targetable span, md round-trip,
  plain-text + factory, md->html no <em>) -> npm test (26th)

npm test (26) passes; existing conversion snapshots survive the new token.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
OCR was unreachable in the running app: convertWithWorker sent conversions to
a Web Worker that uses sync convertContent (no OCR stage), and OCR needs
main-thread canvas decode + onnxruntime which a Worker can't provide. So image
and scanned-PDF inputs never triggered OCR.

Route png/pdf inputs to convertContentAsync on the main thread (text PDFs still
take the normal path; images and scanned PDFs now run OCR), and ensure the
bundled PP-OCRv5 models are loaded first (idempotent ensurePaddleDefaultModels).

OCR has no dedicated button — it runs automatically when converting an image
or scanned PDF to a text format.

npm test (26) passes.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…urate

Validated the inference pipeline against the real PP-OCRv5 ONNX models with
onnxruntime-node (dev-only). Findings:
- rec is correct: decodes the PaddleOCR word_10 crop to "PAIN", output classes
  C=18385 exactly match the dictionary (blank + 18383 + space). RGB channel
  order + (x/255-0.5)/0.5 normalization are right (no BGR needed)
- det is correct: produces a probability map; 16 text boxes on a real document
- root cause of garbled output: dbPostProcess emitted tight axis-aligned bboxes
  without PP-OCR's "unclip" expansion, clipping character strokes -> wrong
  recognition (avgConf 0.41, CJK gibberish). Adding unclip (distance =
  area*ratio/perimeter, unclipRatio default 1.6) yields coherent, correct text
  on a full product-label document (avgConf 0.978)

- dbPostProcess: add unclipRatio param + outward box expansion
- pipeline unit test: assert exact bbox with unclipRatio:0, plus an expansion
  assertion
- scripts/paddle-ocr-integration-test.js: runs the real rec model via
  onnxruntime-node on a committed fixture (samples/ocr/word-PAIN.png),
  asserts "PAIN" + C==dict; gracefully skips (exit 0) when the dev deps /
  models / fixture are absent, so npm test stays green in CI. Wired in (27th)

Real rec hits 0.991 confidence locally. npm test (27) passes.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…ntrol

Improve hard-case recognition and add explicit quality assessment, validated
against the real PP-OCRv5 models with onnxruntime-node.

- orientation: the cls model outputs [1,2] softmax over [0°,180°]
  (upright->[1,0], flipped->[0,1]). runPaddlePipeline now applies it via
  interpretClsOutput, rotating 180° crops before rec. Validated: a fully
  upside-down document is read correctly (all 16 lines flipped, avgConf 0.976)
- vertical/sideways: tall boxes (h/w > verticalAspect) additionally try 90°
  cw/ccw and keep the highest-confidence read (robust for sideways labels;
  cost limited to the rare tall boxes)
- rotateImageData180 / rotateImageData90(dir) / interpretClsOutput pure helpers
  (unit-tested: 180 is its own inverse, 90 swaps dims + maps corners, cls
  threshold branches)
- quality control: runPaddlePipeline returns a quality summary (lineCount,
  averageConfidence, minConfidence, lowConfidenceLines, rotatedLines, grade
  high/medium/low); enhanceWithOCR records result.quality into
  metadata.modelReview.ocrQuality, alongside the existing per-line confidence +
  detectOCRLowConfidence validator + P9-C OCR-readback verification
- pipeline test: rotation/cls/quality assertions (12 blocks total)

Still limited: strong italic / ornate artistic text (needs minAreaRect +
perspective-warp polygon boxes and a stronger rec model). npm test (27) passes.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…rescued)

Per the steer — only convert the text content (no artistic-style preservation),
focus on denoising. Add image denoising that cleans noisy/artistic-background
images without hurting clean ones.

- denoiseImageData: per-channel 3x3 median filter (edge-preserving, removes
  salt-and-pepper); estimateNoiseLevel: fraction of isolated pixels that jump
  far from their 3x3 median (separates speckle from text edges)
- runPaddlePipeline options.denoise = "auto"(default)|true|false. AUTO denoises
  only when estimateNoiseLevel > threshold (default 0.05), because median
  filtering softens clean text (measured 0.974 -> 0.903), so clean images are
  never denoised
- validated on real models: clean doc (noise 0.016) stays untouched at 0.974;
  15% salt-and-pepper (noise 0.10) is denoised, recovering detection from 4 ->
  16 lines and avgConf 0.692 -> 0.832 (heavy noise collapses detection;
  denoise rescues it)
- quality summary adds denoised + noiseLevel
- pipeline test: noise-estimate / median / auto-gating assertions (14 blocks)

Explicitly not doing artistic-style preservation (text content only) or
minAreaRect perspective. npm test (27) passes.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Skew is the biggest remaining recognition gap: a +10° rotated document drops
from 16 lines/0.974 to 8 lines/0.535 (detection collapses). Auto-deskew fixes
it.

- rotateImageDataByAngle(im, deg): arbitrary-angle nearest-neighbor rotation
  with canvas expansion + white background
- estimateSkewAngle(probData, mapW, mapH): shear-projection histogram variance
  on the (binarized, downsampled) det probability map -> the angle that makes
  text rows most horizontal (the deskew rotation)
- runPaddlePipeline: extract a detect(image) helper; options.deskew =
  true(default)|false. Estimate skew from the first det; if |est| >= minSkew
  (default 3°), rotate the image upright and re-detect, then proceed. Upright
  images estimate ~0 and skip the second det (zero overhead); rec always runs
  once
- validated on real models: +10° doc recovers 8/0.535 -> 16 lines/0.970
  (skewApplied=10), -8° recovers too, upright untouched (skewApplied=0, 0.974)
- quality summary adds skewApplied
- pipeline test: arbitrary rotation + skew estimation (slanted synthetic rows
  detected ~angle, flat rows ~0) — 16 blocks

npm test (27) passes.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Address "enhance recognition of the file's internal text format": group OCR
lines (with bboxes) into a structured document instead of one flat paragraph.

- ocr-structure.js: deriveOcrStructure(lines) sorts by reading order (y,x),
  detects headings by relative font size (line height >= 1.35x median -> heading,
  level 1-3 by ratio), splits paragraphs on large vertical gaps (> 0.7x median),
  joins same-paragraph lines (CJK tight, latin spaced); falls back to a single
  paragraph when bbox geometry is missing. blocksFromOcrResult walks pages
- png-ocr enhanceWithOCR now emits structured heading/paragraph blocks
  (replacing the old "join all 16 lines into one paragraph"); removed the dead
  helper + unused import
- validated on the real product label: 16 lines -> 4 blocks (title becomes a
  heading, body grouped into paragraphs)
- browser-transformer exports deriveOcrStructure / blocksFromOcrResult
- scripts/ocr-structure-test.js (7 blocks) wired into npm test (28th)

npm test (28) passes.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Make the OCR quality data (engine, confidence, grade, low-conf lines, skew,
rotation, denoise) visible — it was computed but never shown.

- fix: _runRepairCycle overwrote the OCR-stage modelReview with the repair
  engine's, dropping the ocr/ocrQuality sub-objects. Merge them back so the
  recognition quality reaches result.quality.modelReview
- index.html: add an "OCR 识别质量" row (#verificationOcrRecognitionRow, hidden
  unless OCR ran)
- app.js renderVerificationReport reads quality.modelReview.ocr + .ocrQuality
  and renders engine / lines / confidence / grade / low-confidence / skew /
  rotation / denoised, with grade-driven coloring
- browser-smoke asserts the row; ocr-baseline adds a regression assertion that
  modelReview.ocr survives the default repair path

npm test (28) passes.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…ification

Bump version 2.2.0 -> 2.3.0 across package.json / tauri.conf.json / Cargo.toml
/ Cargo.lock (desktop-shell-test enforces the sync).

- README: refreshed capabilities (real PP-OCRv5 OCR with orientation/skew/
  denoise/structure/quality, LaTeX rendering, three-layer verification),
  roadmap, limitations, test/sample sections; badge -> 2.3.0
- CHANGELOG: [2.3.0] entry covering the OCR pipeline, LaTeX, verification,
  Repair Engine + model-cache, sample generator, the four latent-bug fixes,
  and the PP-OCRv5 direction pivot
- RELEASE_NOTES_v2.3.0.md added
- DEVELOPMENT_TASKS acceptance command -> trans2former-2.3.0
- gitignore the tesseract vendor too (reproducible, like onnxruntime/paddleocr)

npm test (28), git diff --check, release:prepare (release/trans2former-2.3.0,
git-ignored) all pass.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…lity, OCR blockId, reproducible models

- sanitize tesseract vendor bundles (rewrite CDN defaults to same-origin, drop .map, post-assert) so local-security-test passes
- security-center: mark PP-OCRv5 vendor-ready before ensureProbe so imported det+rec flips to available
- make cls (direction classifier) optional across availability gates (runtime/pipeline already tolerated null cls)
- assign stable ocr-block ids at append time + map OCR lines to blocks by trimmed-text containment so low-confidence repair targets resolve (png + scan-pdf); drop the never-incrementing scan-pdf loop
- add reproducible 'npm run vendor:paddle' (pinned ppu-paddle-ocr-models + committed SHA-256 manifest), wired into release:prepare
- parseCharDictionary: don't double-append space when dict already ends with one (fixes off-by-one vs ppu dict); never trim() so U+3000 token is preserved
- resource-budget: correct public/vendor budget to account for tesseract (~30MB) it previously omitted
- align README/.gitignore/PP_OCRV5_BROWSER_VERIFICATION docs to bundled+reproducible policy; update/extend tests

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings May 31, 2026 05:38

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 23cd73a68d

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

export const PADDLE_OCR_MODEL_FILES = Object.freeze(["det.onnx", "cls.onnx", "rec.onnx"]);
// 必选集:det(DB 检测)+ rec(CTC 识别)。cls(方向分类)为可选——管线运行时已容忍其
// 缺失(clsSession 为 null 时跳过 180° 校正),故不纳入可用性闸门。
export const PADDLE_OCR_REQUIRED_FILES = Object.freeze(["det.onnx", "rec.onnx"]);

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Require the PP-OCR dictionary before advertising readiness

When PP-OCRv5 models are supplied through the Security Center, the UI only offers det.onnx, cls.onnx, and rec.onnx, and this readiness check marks the engine available after just det+rec. In that manual-import path, or if the bundled dict.txt fetch fails, recognize() falls back to dictionary = [], so ctcGreedyDecode cannot map any non-blank class to text and the OCR stage returns empty output despite the engine being selected as ready. Include dict.txt in the required set/import flow or fail readiness until it is present.

Useful? React with 👍 / 👎.

Comment thread public/security-center.js
// 先置位 vendor-ready(用户已选用 PP-OCRv5),再 probe;否则 ensureProbe 在 vendor
// 未置位时恒返回 false,状态永远翻不过去。真正的 onnxruntime 运行时加载仍在
// recognize() 时把关。对齐 paddle-default-models.js / tesseract 导入流程的顺序。
markPaddleOcrVendorReady(true);

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Do not mark Paddle ready without the ONNX runtime vendor

In environments where the user imports det/rec but public/vendor/onnxruntime/ort.min.mjs is absent (the sync script explicitly skips when the optional dependency is missing, and the vendor dir is gitignored), this line still marks PP-OCR as vendor-ready; after ensureProbe() it outranks a working Tesseract engine, then recognize() fails with OCR_VENDOR_LOAD_FAILED and the OCR stage does not fall back. Keep Paddle unavailable unless the ONNX runtime bundle is actually loadable, or fall back to the next available engine on runtime-load failure.

Useful? React with 👍 / 👎.

@Vantalens Vantalens merged commit fe05805 into main May 31, 2026
1 check failed
@Vantalens Vantalens deleted the codex/p8-routing-release-baseline branch May 31, 2026 05:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants