diff --git a/DEVELOPMENT_TASKS.md b/DEVELOPMENT_TASKS.md index d47c53a..ebd9c3a 100644 --- a/DEVELOPMENT_TASKS.md +++ b/DEVELOPMENT_TASKS.md @@ -1,6 +1,6 @@ # Trans2Former Development Tasks -最后更新:2026-05-27 +最后更新:2026-05-28 维护规则: @@ -16,10 +16,12 @@ > Trans2Former Desktop:基于 Tauri + Web-GUI 的专业级、本地优先、零上传、多格式、高质量桌面格式转换工作台。 - 当前 Web 应用继续作为转换核心和 GUI 验证底座,最终面向桌面体验。 -- 桌面形态采用 Tauri,不依赖 Office、LibreOffice、Pandoc、云端转换或 OCR/AI。 +- 桌面形态采用 Tauri,不依赖 Office、LibreOffice、Pandoc、云端转换或云端 OCR/AI;OCR、版面、表格和质量审核能力的代码核心内置,模型资源不进入默认安装包,OCR 模型按需下载到本地 model-cache。 +- 默认安装包目标体积 30–80 MB;默认包不含 GB 级模型(PaddleOCR-VL / Qwen-VL / MinerU 等高级 OCR 资源完全独立、按需获取)。 - 转换核心围绕 `input -> canonical model -> executed mapper route -> QualityReport / Warnings -> output`,避免 N×N 私有路径;兼容期保留 `DocumentModel` 外壳,但不得用它掩盖专属模型的实际损失。 -- 热门基础格式必须免下载可用;高保真、OFD、本地 OCR/layout/table 全部进入核心本地模块,不再提供插件安装。 -- 文档处理、预览、编辑和导出阶段必须禁联网。 +- 热门基础格式必须免下载可用;高保真、OFD、本地 OCR/layout/table、Repair Engine 等能力作为核心内置模块演进,不再提供插件安装;模型资源按需下载。 +- 转换后检验作为核心差异化能力,三层组合:规则 diff + SSIM 视觉对比 + OCR 回读,统一写入 QualityReport。 +- 文档处理、预览、编辑和导出阶段必须禁联网;OCR 模型仅在用户首次启用时下载,下载完成后所有识别在本机执行。 ## 阶段状态 @@ -36,15 +38,26 @@ | P7-B 跨平台发布与签名 | 待启动 | macOS/Linux 构建、签名/公证、更新和平台 smoke 待补 | | P8-A 多模型路由可见性基线 | 已完成(2026-05-27 校准) | RoutePlanner 路径温度与强制降级 warnings 已接入 QualityReport | | P8-B 执行型 mapper 与路径校准 | 已完成(2026-05-27) | Workbook/Semantic 稳定链已真实执行;PPTX/OFD 高风险路径按证据分级 | -| P9 质量证据升级 | 待启动 | SSIM 视觉对比框架已建立,待推进到可运行实现 | +| S2 Repair Engine 与审核数据契约 | 已完成(2026-05-28) | RepairAction 契约、规则驱动 validator/handler、复核循环和 md 自映射 round-trip 已接入 convert() | +| UI-A 三视图重构 | 已完成(2026-05-28) | Landing 入口 + Workbench 双栏 + 独立预览页 `preview.html`;hash 路由 `#/`、`#/workbench` 与 `?taskId=` 已接入 | +| S3 按需下载与本地缓存治理 | 已完成(2026-05-28) | model-cache 模块骨架、ModelManifest 契约、SHA-256 校验、状态机、UI 文案与安全中心卡片就位;P9-A 接入即可 | +| P9-A.1 OCR 契约与占位 | 已完成(2026-05-28) | OCRResult / OCREngine 契约、placeholder engine、S3 manifest 占位、PNG reader 发 OCR_UNAVAILABLE warning | +| P9-A.2 接入轻量 OCR runtime(vendor + 骨架) | 已完成(2026-05-28) | tesseract.js optionalDependency + sync-tesseract-vendor + TesseractEngine 骨架 + OCRStorage 抽象 + Tauri CSP 加 wasm-unsafe-eval | +| P9-A.2.b tessdata IDB + UI 启用 + 真实 OCR 接入 | 已完成(2026-05-28) | 真实 IndexedDBStorage、安全中心导入按钮 + SHA-256、recognize 接入 tesseract.js、enhanceWithOCR helper | +| P9-A.3 PNG 异步 OCR 接入 + Repair 入口 | 已完成(2026-05-28) | convertContentAsync + runOCRStage 把 OCR 写入 SemanticDoc;detectOCRLowConfidence 进入 Repair Engine 默认 validator | +| P9-A.4 扫描 PDF OCR 检测 + Rasterizer 骨架 | 已完成(2026-05-28) | isScannedPdf 启发式 + PdfPageRasterizer 抽象 + 多页 OCR stage + convertAsync PDF 分支 | +| P9-B OCR → FixedLayoutModel + 浏览器 rasterize | 已完成(2026-05-28) | OCR 多页结果 → FixedLayoutModel(含 bbox/confidence/readingOrder)→ fixedLayoutToSemantic 派生 blocks;浏览器端 defaultPdfPageRasterizer 自动 dynamic import vendor pdfjs | +| P9-C 转换后检验三层 | 待启动 | 规则 diff、SSIM 视觉对比、OCR 回读检验统一写入 QualityReport | +| P9-D 高级 OCR | 待启动 | PaddleOCR-VL / MinerU 等大模型作为独立本地资源按需下载,明确体积、内存、降级路径 | 详细子任务和验收门槛见 [docs/archive/DEVELOPMENT_HISTORY.md](docs/archive/DEVELOPMENT_HISTORY.md)。 ## 下一步执行顺序 -1. **P9 质量证据升级**:以校准后的路径等级为基线,把 SSIM 视觉对比推进到可运行实现,补 PDF/OFD/扫描件版面恢复的公开样例和质量报告。 -2. **P7-B 跨平台发布与签名**:在转换能力表述准确后,于对应构建环境完成 macOS/Linux 安装包、签名/公证、自动更新、平台 smoke、文件关联和桌面权限体验。 -3. **发布前回归**:`npm test`、`git diff --check`、`npm run release:prepare`、release manifest ignore 验证。 +1. **P9-C 转换后检验三层**:规则 diff、SSIM 视觉对比、OCR 回读检验三层组合统一写入 QualityReport,作为项目核心差异化能力落地。 +2. **P9-D 高级 OCR**:接入 PaddleOCR-VL / MinerU 等本地解析模型;模型资源完全独立按需下载,明确体积、运行内存、降级路径和失败提示。 +3. **P7-B 跨平台发布与签名**:在转换能力表述准确后,于对应构建环境完成 macOS/Linux 安装包、签名/公证、自动更新、平台 smoke、文件关联和桌面权限体验。 +4. **发布前回归**:`npm test`、`git diff --check`、`npm run release:prepare`、release manifest ignore 验证。 ## P8-B 完成结果 @@ -71,6 +84,17 @@ > 仅保留最近 4 周内的记录;更早的归档到 [docs/archive/DEVELOPMENT_HISTORY.md](docs/archive/DEVELOPMENT_HISTORY.md),逐次发布的细节走 [CHANGELOG.md](CHANGELOG.md)。 +- **2026-05-28 (P9-B OCR → FixedLayoutModel + 浏览器 rasterize 真实化)**:把 OCR 结果接到第三个规范模型 FixedLayoutModel 上、把浏览器/Tauri 端 PDF rasterize 开箱即用。新增 `public/core/ocr/ocr-to-fixed-layout.js`(`ocrResultToFixedLayoutPage` 按 bbox.y → bbox.x 排序 + 携带 confidence;`mergeOCRResultsToFixedLayout` 多页合并 + `metadata.readingOrder = "heuristic-yx"` + `metadata.ocr` 总览;复用 `createFixedLayoutModel` / `createPage` / `createTextRun`)、`public/core/ocr/pdf-rasterizer-browser.js`(`createBrowserPdfPageRasterizer`:dynamic import `/vendor/pdfjs/pdf.min.mjs` + `getDocument({ data })` → `page.getViewport` → `.toDataURL("image/png")`;失败抛 `OCR_RASTERIZER_FAILED` 含 cause)。改造 `public/core/ocr/pdf-rasterizer.js`:`defaultPdfPageRasterizer` 优先级 inject → 浏览器自动 → throw `OCR_RASTERIZER_UNAVAILABLE`;首次调用自动 `import("./pdf-rasterizer-browser.js")`,Node 检测 `globalThis.document?.createElement` 缺失即放弃;`resetPdfPageRasterizer` 同时清两个缓存。改造 `public/core/ocr/scan-pdf-stage.js`:收集每页 pageResult 数组 → `mergeOCRResultsToFixedLayout` → `model.fixedLayout = fixedLayout` → `fixedLayoutToSemantic` 派生 blocks → 发 `MODEL_VISUAL_FIDELITY_LOST` + `MODEL_TEXT_ORDER_HEURISTIC` info warning → `metadata.modelReview.ocr.fixedLayout = getFixedLayoutSummary(...)`;`metadata.ocr.lines` 保留供 Repair Engine validator 使用。`public/core/models/fixed-layout.js` `createTextRun` 加 `confidence`(clamp 到 [0,1])+ `createPage` 加 `readingOrderHint`,不破坏现有 OFD / PDF reader 调用。`browser-transformer.js` 顶层 export `ocrResultToFixedLayoutPage` / `mergeOCRResultsToFixedLayout` / `READING_ORDER_HEURISTIC` / `createBrowserPdfPageRasterizer` / `MODEL_VISUAL_FIDELITY_LOST` / `MODEL_TEXT_ORDER_HEURISTIC` / `createFixedLayoutModel` / `createFixedLayoutPage` / `createFixedLayoutTextRun` / `createFixedLayoutBbox` / `getFixedLayoutSummary` / `fixedLayoutToSemantic`。`scripts/ocr-baseline-test.js` 扩展为 34 组断言(+`ocrResultToFixedLayoutPage` y/x 排序 + confidence 携带;`mergeOCRResultsToFixedLayout` 多页合并 + `getFixedLayoutSummary` 计数;`runScannedPdfOCRStage` stub 端到端后 `model.fixedLayout.pages.length === 2` + `MODEL_VISUAL_FIDELITY_LOST` + `MODEL_TEXT_ORDER_HEURISTIC` warning;`defaultPdfPageRasterizer` inject → auto-browser → throw 优先级)。`scripts/local-security-test.js` 把 `ocr-to-fixed-layout.js` + `pdf-rasterizer-browser.js` 加入 ALLOWED + STRICT 白名单。`scripts/local-model-direction-test.js` 守门关键词加 `ocrResultToFixedLayoutPage` / `mergeOCRResultsToFixedLayout` / `createBrowserPdfPageRasterizer` / `MODEL_TEXT_ORDER_HEURISTIC`。新 spec `docs/superpowers/specs/2026-05-28-p9b-ocr-fixedlayout-design.md`。本轮**不实现高级阅读顺序算法**(multi-column / heading detection 留给 P9-C/D;用 y → x 启发式 + warning)、**不入库真实扫描 PDF fixture**(继续用 stub)、**不动同步 convert() / PNG enhance / Repair Engine handlers / Tauri CSP / npm 依赖**。`npm test` 20 个脚本全量通过。 +- **2026-05-28 (P9-A.4 扫描 PDF OCR 检测 + Rasterizer 骨架 + 多页 stage)**:把 OCR 扩展到扫描型 PDF 路径。新增 `public/core/ocr/pdf-rasterizer.js`(`isScannedPdf(content, options)` 启发式:基于 `expandPdfContentForTextExtraction` + 检测 `PDFJS_PAYLOAD_MARKER` + 字符阈值 300;无 payload → 扫描,有 payload 但 < 阈值 → 扫描;`PdfPageRasterizer` 抽象 + `defaultPdfPageRasterizer` Node 默认抛 `OCR_RASTERIZER_UNAVAILABLE`;`setPdfPageRasterizer(impl)` / `resetPdfPageRasterizer()` 让测试注入 stub)、`public/core/ocr/scan-pdf-stage.js`(`runScannedPdfOCRStage(model, ctx)`:拿 rasterizer + engine → countPages → 循环 rasterize 每页 → engine.recognize → 把多页 paragraph blocks 顺序追加到 model + `metadata.ocr.lines` 含 `pageIndex` / `blockId` + `metadata.modelReview.ocr` 总览 pageCount/lineCount/averageConfidence/runtimeMs;错误统一注入 `OCR_ENGINE_FAILED` warning 返回原 model;可选 `options.ocr.maxScanPages`(默认 5)/`dpi`/`scanPdfThreshold`)。`public/core/format-registry.js` `convertAsync` 新增 PDF 分支:检测扫描 PDF → dynamic import `runScannedPdfOCRStage` → 注入 OCR enhancement;文本 PDF 沿用 P8-B 既有路径。`browser-transformer.js` 顶层 export `isScannedPdf` / `runScannedPdfOCRStage` / `defaultPdfPageRasterizer` / `setPdfPageRasterizer` / `resetPdfPageRasterizer` / `OCR_RASTERIZER_UNAVAILABLE` / `OCR_RASTERIZER_FAILED`。`scripts/ocr-baseline-test.js` 扩展为 30 组断言(+isScannedPdf 对无 payload 最小 PDF 返回 scanned=true、defaultPdfPageRasterizer Node 抛 OCR_RASTERIZER_UNAVAILABLE、runScannedPdfOCRStage stub 端到端 2 页追加 + metadata.modelReview.ocr.pageCount=2、convertContentAsync PDF → txt 走 OCR 分支输出含 stub OCR 文本)。`scripts/local-security-test.js` 把 `pdf-rasterizer.js` / `scan-pdf-stage.js` 加入 ALLOWED + STRICT 白名单。`scripts/local-model-direction-test.js` 守门关键词加 `isScannedPdf` / `runScannedPdfOCRStage` / `defaultPdfPageRasterizer`。新 spec `docs/superpowers/specs/2026-05-28-p9a4-scan-pdf-ocr-design.md`。本轮**不实现真实浏览器端 pdfjs canvas 渲染**(留给 P9-B;defaultPdfPageRasterizer Node 默认抛错,浏览器/Tauri 通过 `setPdfPageRasterizer` 注入实现)、**不在仓库加扫描 PDF fixture**(用最小 PDF 头部 + stub rasterizer 覆盖代码路径)、**不动同步 convert() / PNG 异步 stage / Tauri CSP / npm 依赖**。`npm test` 20 个脚本全量通过。 +- **2026-05-28 (P9-A.3 PNG 异步 OCR 接入 + Repair Engine OCR 入口)**:把 P9-A.2.b 提供的 `enhanceWithOCR` 接到 PNG 转换链路上、把 OCR 元数据接到 Repair Engine 上。新增 `public/core/ocr/ocr-stage.js`(`runOCRStage(model, ctx)` 包一层 `enhanceWithOCR` + 注入 `OCR_ENGINE_FAILED` 兜底;提供 `getDefaultOCRLanguage`)、`public/core/ocr/ocr-validator.js`(`detectOCRLowConfidence` 从 `metadata.ocr.lines` 取 confidence < 0.55 的行生成 `replaceTextRun` 候选;每页最多 8 条;evidence 含 engineId / language / bbox / pageIndex / lineIndex)。`public/core/ocr/png-ocr.js` 的 `enhanceWithOCR` 现在把 `pages[].lines` 一一写入 `model.metadata.ocr.lines`,含 `{ pageIndex, lineIndex, text, confidence, bbox, blockId }`,让 validator 能用 blockId 反查 paragraph。`public/core/format-registry.js` 抽出 `_buildRepairCtx` 与 `_wrapWithRepairCycle` 共享 helper,新增 `async convertAsync(...)`:与 `convert()` 同样的入口校验和 prepareConversionModel;当 `options?.ocr?.enabled !== false && fromFormat === "png"` 时 await dynamic-import `runOCRStage` 注入 OCR enhancement;之后走同一 `_wrapWithRepairCycle`。`public/core/repair-engine.js` 的 `createDefaultRepairEngine()` 注册 `detectOCRLowConfidence`(在 DEFAULT_VALIDATORS 之后)。`public/browser-transformer.js` 顶层 export `convertContentAsync` / `runOCRStage` / `getDefaultOCRLanguage` / `detectOCRLowConfidence`。`public/app.js` 的 `convertWithWorker(payload)` 在 worker 不可用时检测 `payload.from === "png"` 改走 `convertContentAsync`,其他格式仍走 `convertContent`。`samples/png/t2f-sample.data-url.txt`(80×24 灰度 PNG,白底黑字"T2F",118 字节 base64 后约 182 字符)+ `samples/png/README.md` 说明用途与浏览器端真实 OCR 验证步骤。`scripts/ocr-baseline-test.js` 扩展为 26 组断言(+convertContentAsync 在 ocr.enabled=false 时返回 writer payload、stub engine 注册后输出文本包含 stub OCR 内容、runOCRStage 持久化 metadata.ocr.lines、detectOCRLowConfidence 对 confidence < 0.55 / >= 0.55 两种场景的行为、t2f-sample fixture 不抛错)。`scripts/local-model-direction-test.js` 守门关键词加 `convertContentAsync` / `runOCRStage` / `detectOCRLowConfidence`。新 spec `docs/superpowers/specs/2026-05-28-p9a3-async-ocr-pipeline-design.md`。本轮**不破坏现有同步 convert() / convertContent()**(所有 20 个测试脚本与 smoke-test 调用方完全不变),不做扫描 PDF,不修改 Tauri CSP,不引入新 npm 依赖,不真实跑 OCR 在 npm test(用 stub engine 覆盖代码路径)。`npm test` 20 个脚本全量通过。 +- **2026-05-28 (P9-A.2.b tessdata IDB + UI + 真实 OCR 接入)**:在 P9-A.2 vendor 骨架基础上把三块拼上。新增 `public/core/ocr/indexeddb-storage.js`(`IndexedDBStorage` 类;`trans2former-ocr-cache` 数据库,`tessdata` + `metadata` 双 object store;put 单事务原子写两个 store;错误统一抛 `OCR_STORAGE_IDB_ERROR`)、`public/core/ocr/tesseract-runtime.js`(`loadTesseractRuntime` 动态 import `/vendor/tesseract/core/tesseract.min.js`,失败抛 `OCR_VENDOR_LOAD_FAILED`;`createTesseractWorker` 用 vendor 路径 + tessdata blob URL;`runRecognize` 把 tesseract data 映射成 `OCRResult` 含 pages/lines/bbox/confidence)、`public/core/ocr/png-ocr.js`(`enhanceWithOCR(model, { engine })` 解析第一个 image asset → engine.recognize → 追加 paragraph blocks + `metadata.modelReview` 含 `summarizeOCRResult`;低置信度发 `OCR_LOW_CONFIDENCE`;engine 不可用发 `OCR_UNAVAILABLE`)。改造 `ocr-storage.js` 用 `LazyIndexedDBStorage` 动态 import IDB 实现,Node fallback 到 InMemory。`tesseract-engine.js` recognize 接入真实链路:vendor + storage 检查 → createTesseractWorker → runRecognize → disposeWorker。安全中心 `renderModelCache` 对 tesseract 行渲染三个按钮(导入 chi_sim/eng tessdata + 清除缓存);事件委托走 `` → arrayBuffer → `sha256Hex` → `defaultOCRStorage.put` → `tesseractOCREngine.ensureProbe()` → `markTesseractVendorReady(true)` → `setStatus(STATUS_AVAILABLE)`;`[data-model-cache-status]` 区域显示 info/success/error 三级状态消息。`index.html` 加 ``;`styles.css` 加 `.model-cache-row-actions` / `.model-cache-status-message[data-level]` 视觉规则。`browser-transformer.js` 顶层 export 新 API(IndexedDBStorage / loadTesseractRuntime / createTesseractWorker / runRecognize / disposeWorker / enhanceWithOCR / OCR_VENDOR_LOAD_FAILED / TESSERACT_VENDOR_PATHS)。`scripts/ocr-baseline-test.js` 扩展到 20 组断言(+loadTesseractRuntime/recognize 实链路在 Node 抛 OCR_VENDOR_LOAD_FAILED、enhanceWithOCR no-engine + stub + low-confidence 三种场景、SHA-256 + InMemoryStorage 元数据完整)。`scripts/local-security-test.js` 把 indexeddb-storage / tesseract-runtime / png-ocr 加入 ALLOWED + STRICT 白名单。新 spec `docs/superpowers/specs/2026-05-28-p9a2b-tessdata-runtime-design.md`。本轮不挂 enhanceWithOCR 进 convert pipeline(A.3 工作)、不创建真实 PNG/tessdata fixture(npm test 用 stub engine 覆盖代码路径)、不引入新 npm 依赖。`npm test` 20 个脚本全量通过。 +- **2026-05-28 (P9-A.2 Tesseract runtime vendor + 骨架)**:在 P9-A.1 之上接入第一条真实 OCR runtime 候选 `tesseract.js`,但本轮仅完成 vendor + Engine 骨架 + CSP。新增 `scripts/sync-tesseract-vendor.js`(模仿 sync-pdfjs-vendor 风格,从 `node_modules/tesseract.js/dist/` + `node_modules/tesseract.js-core/` 同步资源到 `public/vendor/tesseract/{core,worker}/`;缺包时 exit 0 不阻塞),`public/core/ocr/ocr-storage.js`(`OCRStorage` 接口 + `InMemoryStorage` + `createIndexedDBStorage` 工厂占位 + `defaultOCRStorage` 单例;非法 key/value 抛 `OCR_STORAGE_INVALID_*`),`public/core/ocr/tesseract-engine.js`(`tesseractOCREngine` 实现 OCREngine 接口;`isAvailable` 检查 `__t2fTesseractVendorReady` 标志 + storage 中是否有 tessdata;recognize 分三阶段拒绝路径 vendor-not-ready / tessdata-missing / runtime-not-wired),`public/core/ocr/tesseract-bootstrap.js`(副作用 import:注册 `ocr-text.tesseract.5.0.0` manifest 到 `defaultModelCache`,status: not-downloaded;注册 tesseract engine 到 `defaultOCRRegistry`)。`package.json` 加 `tesseract.js@^5.1.1` 到 optionalDependencies、加 `vendor:tesseract` script、`release:prepare` 加入 tesseract vendor sync。`src-tauri/tauri.conf.json` CSP 加 `'wasm-unsafe-eval'` 让 wasm 实例化。`browser-transformer.js` 顶部 import tesseract-bootstrap + 全部 API export。`scripts/local-security-test.js` 把 ocr-storage / tesseract-engine / tesseract-bootstrap 加入白名单 + STRICT_LOCAL_ONLY_FILES;`isLocalVendorPdfJs` 重命名 `isLocalVendorAsset` 同时识别 `public/vendor/tesseract/**`。`scripts/ocr-baseline-test.js` 扩展为 15 组断言(+TesseractEngine 注册 + recognize 三阶段拒绝路径 + InMemoryStorage CRUD + defaultOCRStorage 实例校验)。`scripts/local-model-direction-test.js` 守门加 `TesseractEngine` / `defaultOCRStorage` / `tesseract.js`。新 spec `docs/superpowers/specs/2026-05-28-p9a2-tesseract-runtime-design.md`。本轮不实际跑 OCR / 不接入 IDB I/O / 不加 UI 启用按钮(A.2.b 工作);不引入新依赖以外的运行时 package(tesseract.js 仅 optionalDependency,缺失不阻塞)。`npm test` 20 个脚本全量通过。 +- **2026-05-28 (P9-A.1 OCR 契约与占位)**:在 S3 之上落地 OCR 转换链路的"契约 + 占位 + 接入点"。新增 5 个模块 `public/core/ocr/`:`ocr-result.js`(OCRResult 数据契约 + `createOCRResult` + `validateOCRResult` + `summarizeOCRResult`;非法字段抛 `OCR_RESULT_INVALID`)、`ocr-warnings.js`(`OCR_UNAVAILABLE` / `OCR_LOW_CONFIDENCE` / `OCR_ENGINE_FAILED` / `OCR_DEGRADED_ROUTE` 常量 + 工厂函数)、`ocr-engine.js`(`OCREngine` 接口校验 + `OCREngineRegistry` + `defaultOCRRegistry` + `pickForTask` 优先 available 路径)、`placeholder-engine.js`(永远 unavailable,`recognize` 抛 `OCR_UNAVAILABLE`)、`ocr-bootstrap.js`(副作用 import:把 placeholder engine 注册到 `defaultOCRRegistry`,对应 manifest 注册到 `defaultModelCache` 并立即设为 `STATUS_DISABLED`)。`browser-transformer.js` 顶层 import bootstrap + 全部 API export。`public/formats/png.js` 在 reader 末尾调用 `defaultOCRRegistry.pickForTask("ocr-text")?.isAvailable()`,不可用时往 `metadata.warnings` 注入 `OCR_UNAVAILABLE` info 级 warning(含 engineId/manifestId),其余 image asset + heading 流程保留。安全中心「模型缓存」card 通过 `defaultModelCache.onChange` 自动显示「OCR 文字识别 · 占位」条目。新增 `scripts/ocr-baseline-test.js`(10 组断言:schema 常量、契约校验 happy + 错误用例、warning 工厂、Registry 合规/重复/无效注册、pickForTask 优先级、placeholder 行为、bootstrap 幂等、PNG reader 在 placeholder/stub 两种模式下的 warning 行为)并接入 `npm test`。新 spec `docs/superpowers/specs/2026-05-28-p9a-ocr-baseline-design.md` 作为 P9-A.2 接入参考。本轮不引入 Tesseract.js / 不修改 Tauri CSP / 不实际跑 OCR 推理。`npm test` 20 个脚本全量通过。 +- **2026-05-28 (S3 模型缓存基础设施)**:实现按需下载与本地缓存治理的契约、模块骨架和守门,对接 P9-A 准备好基础设施。新增 5 个模块 `public/core/model-cache/`:`manifest.js`(ModelManifest schema + `createModelManifest` + `validateModelManifest` + 4 类 task / 5 个 engine / 4 种 quantization / 3 种 fallback 常量;非法字段抛 `MODEL_MANIFEST_INVALID`)、`checksum.js`(基于 `crypto.subtle.digest` 的 `sha256Hex` + `verifyChecksum`,已验证 SHA-256("abc") 等已知向量)、`cache-paths.js`(`model-cache////` 统一目录结构 + 不安全路径拒绝)、`availability.js`(6 个状态常量 + `ModelCacheRegistry` + `defaultModelCache` 单例 + `onChange` 事件)、`ui-text.js`(4 类任务的首次启用 / 断网降级 / 清理 / 状态 / label 中文文案)。`browser-transformer.js` 顶层导出全部 API。`security-center.js` 新增 `renderModelCache`,监听 `defaultModelCache.onChange` 自动刷新;`index.html` 安全中心 dialog 内增加「模型缓存」card;`styles.css` 加 `.model-cache-card` / `.model-cache-row` / `.model-cache-status` 视觉规则。新增 `scripts/model-cache-test.js`(9 组断言:schema 常量、manifest 校验 happy + 11 条非法用例、SHA-256 known vector、verifyChecksum 大小写不敏感、cache-paths 正反解 + 不安全 fileName 拒绝、Registry register/setStatus/onChange/unsubscribe/unregister 状态机、defaultModelCache 实例校验、UI 文案完整性)并接入 `npm test`。新 spec `docs/superpowers/specs/2026-05-28-on-demand-model-cache-design.md` 作为 P9-A 接入参考。本轮不引入 npm 依赖、不实际下载、不修改 Tauri CSP、不动 Repair Engine 与转换核心。`npm test` 19 个脚本全量通过。 +- **2026-05-28 (UI-A 三视图重构)**:把单页工作台拆分为「Landing 入口 + Workbench 双栏 + 独立预览」。新增 `public/router.js`(hash 路由 `#/` 与 `#/workbench`、`openPreview` 帮助函数、Tauri/浏览器分流跳转、preview payload 通过 localStorage/sessionStorage 双层缓存 + 30 分钟 TTL)、`public/landing-view.js`(动态 hero / 4 张特性卡片 / `getKnownInputFormats()` 驱动的 13 行格式矩阵 + routeClass 徽章 / 5 步工作流 / CTA,IntersectionObserver 触发 `.reveal-on-scroll` 入场动画,`prefers-reduced-motion` 关动画)、`public/styles/landing.css`(slate+teal 配色延续,新增 hero 渐变 glow + 卡片 hover 提升 + 滚动揭示)、`public/preview.html` + `public/preview.js` + `public/styles/preview.css`(独立预览页:文本类走 `renderPreviewHtml`,PDF 走 iframe,PNG 走可拖拽+滚轮缩放 ``,DOCX/XLSX/PPTX/EPUB/OFD 二进制通过 `toDocumentModel` 反解为 SemanticDoc 后 `modelToBodyHtml` 渲染;顶栏含缩放、下载、返回;ESC 返回工作台;blob URL 自动清理)。`public/index.html` 顶栏改为视图感知(`is-landing-only` / `is-workbench-only`)、landing/workbench 双 section 通过 `data-view` 切换 hidden、新增"独立预览"按钮在输出卡片头部;`public/styles.css` 加视图切换基础规则。`public/app.js` 接入 `openPreview` 并通过 `currentOutputType/Format/Mime/BlobUrl` 构造 payload,转换完成或重置时同步启用/禁用按钮。`scripts/browser-smoke-test.js` 新增对 `data-view`、`/preview.html`、`/router.js`、`/landing-view.js`、`/styles/landing.css`、`/styles/preview.css` 等的断言;`scripts/local-security-test.js` 把 `router.js` 与 `preview.js` 加入 `ALLOWED_PUBLIC_FILES`(用于本地 blob fetch 与 localStorage 短期 preview payload),并新增 `STRICT_LOCAL_ONLY_FILES` 守门——这两个文件不得含 `http(s)://` / `ws(s)://` 任何远程协议字符串。`npm test` 18 个脚本全量通过。 +- **2026-05-28 (方向调整)**:用户确认《后续开发调整结论》,逆转 S1 中「模型资源随正式安装包交付 + 安装包内置」叙事。新方向:默认安装包 30–80 MB;OCR / 版面 / 表格代码核心内置,模型资源不打包;OCR 按需本地下载到 model-cache;高级 OCR(PaddleOCR-VL / MinerU)作为独立本地资源;转换后检验提升为核心差异化能力(规则 diff + SSIM + OCR 回读)。新增 spec `docs/superpowers/specs/2026-05-28-lightweight-default-bundle-direction.md` 作为正式决策文档;旧 spec `2026-05-27-local-document-model-auto-repair-output-closure-design.md` 和 S1 plan `2026-05-28-local-model-output-closure-s1.md` 顶部加修订声明,保留历史轨迹但不再作为方向真值。`DESKTOP_APP_ARCHITECTURE`、`DESKTOP_RELEASE_PLAN`、`RESOURCE_BUDGET`、`MULTI_MODEL_ARCHITECTURE`、`PRODUCT_STRATEGY`、`CONVERSION_ROUTING`、`README`、`INSTALL` 全部同步;`local-model-direction-test` 守门关键词整体翻新(移除「安装包内置」「模型资源随安装包交付」「内置模型 manifest」等冲突 includes,新增「OCR 模型按需下载」「30–80 MB」「OCR 模型资源不进入默认安装包」「OCR 模型缓存」「默认包不含 GB 级模型」等 includes,并新增三组反义短语 patterns 防止旧方向复活)。本轮不写任何 OCR 代码,S3 / P9-A 实施留给下一轮 plan。`npm test`、`npm run release:prepare`、`git diff --check` 通过。 +- **2026-05-28 (S2)**:落地 Repair Engine 与审核数据契约。新增 `public/core/repair-actions.js`(7 类 RepairAction 契约 + `validateRepairAction` 校验、不合法字段抛 `REPAIR_ACTION_INVALID`)、`public/core/repair-handlers.js`(`replaceTextRun` 字符串级 block 替换 + `selectFallbackRoute` 默认仅推荐、`options.repair.applyFallback=true` 才切换 writer + 5 个未实现 placeholder)、`public/core/repair-validators.js`(`detectLossyRepairHints` 从 `metadata.warnings[*].details.repairAction` 提取建议 + `detectRouteClassDegradation` 对 generated/restricted 路径建议更保守输出)、`public/core/repair-engine.js`(`RepairEngine` 类 + `defaultRepairEngine` 单例 + `runCycle` 完成 propose → apply → model-level 复核 → md/html/json/csv/txt/xml 自映射 round-trip 复核)。`ConverterRegistry.convert()` 在 write 之后挂 cycle,返回值改为 `{ ...writerOutput, quality: { qualityReport, modelReview, autoRepair, conversion } }` 的非破坏性 superset;`options.repair === false` 保留旧路径。`browser-transformer.js` 导出 `defaultRepairEngine` / `RepairEngine` / `REPAIR_ACTION_TYPES` / `createRepairAction` / `validateRepairAction` / `MIN_CONFIDENCE`。新增 `scripts/repair-engine-test.js` 覆盖契约校验、引擎单元、低置信度拒绝、handler 异常捕获、md→md round-trip ok、md→html round-trip skipped、md→pptx fallback 建议、`applyFallback` 切换、`repair: false` 短路、fallback handler 失败 11 组断言。本轮 `npm test` 全量通过;Repair Engine 当前仅规则驱动,真实模型审核留给 S3 接入。 +- **2026-05-28 (S1)**:启动本地专用模型自动修复与输出闭环 S1。新增 `product-matrix-docs-test`,用 `getAllowedOutputFormats()` 约束 `CONVERSION_PATHS.md`,修正文档矩阵缺失的 `XML` 输入行和多条 `-> XML` 路径说明;新增 `local-model-direction-test`,锁定“内置本地专用模型 + Repair Engine 自动修复 + 无云端 OCR/AI + 安装包交付、按需加载、可禁用”的方向。同步更新 `DESKTOP_APP_ARCHITECTURE`、`DESKTOP_RELEASE_PLAN`、`RESOURCE_BUDGET`、`PRODUCT_STRATEGY` 和 `MULTI_MODEL_ARCHITECTURE` 的旧表述。已通过 `npm test`、`npm run release:prepare`、`git diff --check` 和 release manifest ignore 检查。本次只完成 S1 门禁和方向同步,Repair Engine、模型运行时、PNG/OFD 输出、PDF/PPTX 高保真闭环仍待后续阶段实现。 - **2026-05-27**:完成审核整改阶段、P7-A Windows 发布构建基线与 P8-B 执行型 mapper 校准。修复 Tauri/Rust 仍停在 `2.0.0` 且无 bundle icon 导致 Windows 安装包失败的问题,新增配置门禁并实际生成 `Trans2Former_2.2.0_x64_en-US.msi` 与 `Trans2Former_2.2.0_x64-setup.exe`。`SemanticDoc <-> WorkbookModel` 已进入真实执行链并记录 `executedMappers`;`PPTX` 生成型与 `OFD -> PDF` 受限路径记录 `routeClass` 并发出 `PATH_NOT_RECOMMENDED`。修复 TXT 等纯文本导出 Markdown 时 `` 原样激活 HTML 的回归,同时保留 `- [x]` task list。核心 smoke 扩展为 46 组。 - **2026-05-26**:跨格式转换质量回归。Markdown writer 统一走 `getInlineTokens → inlinesToHtml/Markdown`,废弃旧 `inlineMarkdownToHtml` 正则兜底;脚注 `[^id]` 升级为 `footnoteRef` 一等公民 inline 节点(HTML/MD/XML/DOCX vertAlign superscript/PDF 各自渲染)。修复 md→md 嵌套有序列表跳号(独立 `orderedCounter` 仅在 depth=0 递增);task list `[x]` 不再被错误转义(escape 字符集收窄到 ``\ ` * _ ~``,放过 `[]<>`)。markdown reader 合并连续 `>` 行消除 `

>

` 孤段;html reader 修复 `
  • ` 嵌套 `
      /
        ` 被展平为 inline 文本,新增 70+ HTML 命名实体表。`npm test` 全套通过;DOCX/PDF/XLSX 二进制输出经字节级验证。 - **2026-05-25**:修复用户回归报告三处问题(HTML 分级丢失 / 预览尺寸错乱 / 前端风格偏老旧)。DOCX reader 新增 `parseHeadingStyleMap` 多路兜底识别中英文 `Heading 1`/`标题 1`;`.viewer-card` 从 grid 改 flex column + `min-height: 0` 让预览/textarea 自适应;`.preview-markdown` 补 h4/h5/h6 字号梯度,h1/h2 加 border-bottom;色板换为 slate + teal,Inter `font-feature-settings: cv11/ss01/ss03`。补齐 P1 残留:PDF 输出回填 `autoLinkifySegments` 让纯文本 URL 生成 `/Annots`;`***粗斜体***` / `___...___` 优先识别为 strong×em 嵌套;XML inline 断言更新为结构化输出。 diff --git a/INSTALL.md b/INSTALL.md index 6126376..57c1da8 100644 --- a/INSTALL.md +++ b/INSTALL.md @@ -97,9 +97,9 @@ release/trans2former-2.0.0/RELEASE_MANIFEST.json ## 当前限制 1. PDF 当前支持程序化二进制输出,不再依赖浏览器打印作为主要路径。 -2. DOCX、XLSX、EPUB、PDF text extraction、PPTX input 已完成 P3;DOCX/PDF output 已完成 P4/P6 基线;OFD 和本地 OCR 增强进入核心本地路线。 -3. 不需要安装 Office、LibreOffice、Pandoc、Playwright 或桌面壳程序。 -4. 不提供远程 OCR、远程转写、远程 AI 增强或云端文档处理;本地模型和 OFD 高保真渲染必须保持本地、可删除、可禁用、可回滚。 +2. DOCX、XLSX、EPUB、PDF text extraction、PPTX input 已完成 P3;DOCX/PDF output 已完成 P4/P6 基线;OFD 和本地 OCR 增强代码核心内置,OCR 模型资源不预装,首次启用 OCR 时本地下载到 model-cache。 +3. 不需要安装 Office、LibreOffice、Pandoc、Playwright 或桌面壳程序。默认安装包目标 30–80 MB,不内置 PaddleOCR-VL / Qwen-VL / MinerU 等 GB 级模型。 +4. 不提供远程 OCR、远程转写、远程 AI 增强或云端文档处理;OCR 模型与 OFD 高保真渲染必须保持本地执行、可删除、可禁用、可回滚;高级 OCR 资源(PaddleOCR-VL / MinerU 等)作为独立本地资源按需获取。 ## 升级 diff --git a/README.md b/README.md index 027dbba..e2e0cee 100644 --- a/README.md +++ b/README.md @@ -17,7 +17,7 @@ Trans2Former 是一个专业级的桌面文档转换工具,支持 12 种输入 - 📦 **零依赖** - 不需要安装 Office、LibreOffice 或 Pandoc - 🎨 **实时预览** - 转换前后实时预览文档 - 📝 **结构化编辑** - 支持编辑转换后的文档结构 -- 🧩 **核心内置增强** - OFD、OCR、版面分析等能力进入核心本地路线 +- 🧩 **核心内置增强** - OFD、OCR、版面分析等能力代码核心内置;OCR 模型资源按需本地下载到 model-cache,不进入默认安装包 - 🌍 **多语言** - 支持中英文、RTL 文本等 - ⚡ **无大小限制** - 不设置人为文件大小上限 @@ -127,12 +127,14 @@ Trans2Former/ ## 🧩 核心本地增强 -Trans2Former 不再提供插件安装模式,增强能力直接并入核心本地模块: +Trans2Former 不再提供插件安装模式,增强能力代码直接并入核心本地模块;默认安装包目标 30–80 MB,不内置 GB 级模型,相关模型资源按需本地下载到 model-cache: - **OFD 支持** - 政务格式支持 -- **本地 OCR** - 扫描文档识别 +- **本地 OCR** - 扫描文档识别(首次启用时下载本地 OCR 模型到 model-cache,识别全程本机执行) - **版面分析** - 复杂布局识别 - **表格恢复** - PDF 表格提取 +- **转换后检验** - 规则 diff + SSIM 视觉对比 + OCR 回读三层组合写入 QualityReport +- **高级 OCR**(规划中)- PaddleOCR-VL / MinerU 等大模型作为独立本地资源按需获取 这些能力不通过插件包分发;后续实现必须继续保持本地执行、无上传、可解释降级和资源预算约束。 diff --git a/docs/CONVERSION_PATHS.md b/docs/CONVERSION_PATHS.md index 7f64b71..8742f16 100644 --- a/docs/CONVERSION_PATHS.md +++ b/docs/CONVERSION_PATHS.md @@ -18,16 +18,17 @@ Trans2Former 区分两件事: | --- | --- | --- | | Markdown | Markdown、HTML、TXT、JSON、CSV、XML、DOCX、XLSX、PDF、EPUB、PPTX | 作为结构化轻量源,可导出文档、网页、表格和演示;文档到图片输出等待真实视觉渲染器。 | | HTML | Markdown、HTML、TXT、JSON、CSV、XML、DOCX、XLSX、PDF、EPUB、PPTX | 作为结构化轻量源,可转为常见发布和办公格式。 | -| TXT | Markdown、HTML、TXT、JSON、DOCX、PDF、EPUB | 纯文本不直接导出表格或演示,避免误导。 | +| TXT | Markdown、HTML、TXT、JSON、XML、DOCX、PDF、EPUB | 纯文本不直接导出表格或演示,XML 输出仅表达可读文本结构,避免误导为复杂标记恢复。 | | JSON | Markdown、HTML、TXT、JSON、CSV、XML、DOCX、XLSX、PDF、EPUB、PPTX | 仅当 JSON 可进入 DocumentModel 时适合多目标导出。 | -| CSV | Markdown、CSV、XLSX、HTML、TXT、JSON、PDF | 表格源只提供表格、网页、文本和 PDF 路径。 | -| XLSX | Markdown、CSV、XLSX、HTML、TXT、JSON、PDF | 表格源不提供 PPTX/DOCX 等不可靠跨类型输出。 | -| DOC / DOCX | Markdown、HTML、TXT、JSON、DOCX、PDF | 文档源不直接转 PPTX,避免把正文文档错误包装成演示稿。 | -| EPUB | Markdown、HTML、TXT、JSON、DOCX、PDF、EPUB | 电子书源保留文档和发布路径。 | -| PDF | Markdown、HTML、TXT、JSON、DOCX、PDF | 当前主要是文本型 PDF 抽取,不提供表格/演示高保真输出。 | -| PPTX | Markdown、HTML、TXT、JSON、PDF、PPTX | 演示源可抽取为文档;PPTX 写出仅为重新生成的基础演示,不是原稿保真写回。 | +| XML | Markdown、HTML、TXT、JSON、XML、PDF | XML 源保留可读结构、原始结构化表达和基础发布路径,不自动推断办公专属模型。 | +| CSV | Markdown、CSV、XLSX、HTML、TXT、JSON、XML、PDF | 表格源只提供表格、网页、文本、XML 和 PDF 路径。 | +| XLSX | Markdown、CSV、XLSX、HTML、TXT、JSON、XML、PDF | 表格源不提供 PPTX/DOCX 等不可靠跨类型输出。 | +| DOC / DOCX | Markdown、HTML、TXT、JSON、XML、DOCX、PDF | 文档源不直接转 PPTX,避免把正文文档错误包装成演示稿。 | +| EPUB | Markdown、HTML、TXT、JSON、XML、DOCX、PDF、EPUB | 电子书源保留文档和发布路径。 | +| PDF | Markdown、HTML、TXT、JSON、XML、DOCX、PDF | 当前主要是文本型 PDF 抽取,不提供表格/演示高保真输出。 | +| PPTX | Markdown、HTML、TXT、JSON、XML、PDF、PPTX | 演示源可抽取为文档;PPTX 写出仅为重新生成的基础演示,不是原稿保真写回。 | | PNG | HTML、TXT、JSON、PDF | 图片源进入资产/预览路径,OCR 和图片重渲染进入核心本地增强路线。 | -| OFD | Markdown、HTML、TXT、JSON、PDF | 核心包提供 L0 级本地解析路径,高保真继续并入核心本地增强。 | +| OFD | Markdown、HTML、TXT、JSON、XML、PDF | 核心包提供 L0 级本地解析路径,高保真继续并入核心本地增强。 | ## 路径分级 diff --git a/docs/CONVERSION_ROUTING.md b/docs/CONVERSION_ROUTING.md index 66e0e7e..3dd0fcb 100644 --- a/docs/CONVERSION_ROUTING.md +++ b/docs/CONVERSION_ROUTING.md @@ -125,6 +125,31 @@ P8-B 已引入真实 mapper 执行证据字段 `executedMappers` 并接入 `Sema - 新增格式时必须声明 reader/writer 的真实模型能力,并在新增推荐路径时同步产品矩阵与测试。 - [scripts/conversion-capability-audit-test.js](../scripts/conversion-capability-audit-test.js) 用 Planner 输出做矩阵稳定性断言 +## Repair Cycle(S2 落地) + +S2 在 `ConverterRegistry.convert()` 内、writer 输出之后增加 Repair Cycle,让转换从「读 → 路由 → 写」扩展为「读 → 路由 → 写 → 复核 → 必要时修复 → 再复核 → 输出」。完整步骤: + +1. **propose**:`defaultRepairEngine.proposeActions(model, ctx)` 跑两个默认 validator —— `detectLossyRepairHints` 提取 `metadata.warnings[*].details.repairAction` 中的结构化建议;`detectRouteClassDegradation` 对 `routeClass ∈ {generated, restricted}` 路径建议更保守的输出。 +2. **apply**:对每个候选 action 校验 `confidence ≥ MIN_CONFIDENCE (0.6)` 后查表派发到 handler。`replaceTextRun` 改写 SemanticDoc block 字符串字段;`selectFallbackRoute` 默认仅写入 `autoRepair.recommendations`,需 `options.repair.applyFallback === true` 才真正切换 writer 并替换 output;其余 5 类动作目前注册为占位 reject。 +3. **reverify (model-level)**:cycle 用 `ensureDocumentAudit` 重算 `qualityReport`,要求 `warningCount` / `downgradeCount` 不超过 repair 之前。 +4. **reverify (round-trip)**:当 `from === to` 且格式属于 `md/html/json/csv/txt/xml` 自映射白名单时,cycle 把 writer 输出反读回模型并比对指纹;非白名单路径写 `roundTripDelta.skipped = "format-not-round-trip-safe"`,不阻塞输出。 +5. **finalDecision**:`verified`(无动作或复核通过)/ `degraded`(候选全被 reject)/ `failed-quality-gate`(apply 后复核失败)写入 `autoRepair.finalDecision` 和 `qualityReport.finalDecision`。 + +`convert()` 返回结构变为 `{ ...writerOutput, quality: { qualityReport, modelReview, autoRepair, conversion } }` 的非破坏性 superset。`options.repair === false` 跳过整个 cycle 并返回原 writer payload,给 legacy 测试和未来需要纯净输出的调用方留口子。S2 阶段所有 validator/handler 仍是规则驱动,真实模型审核由 S3 接入。 + +P9-A.3 之后,PNG 输入路径建议走 `convertContentAsync(payload)` —— 在 `prepareConversionModel` 与 `write` 之间插入 `runOCRStage(model, ctx)`,让 OCR 识别文本作为 paragraph blocks 写入 SemanticDoc;之后走同一 `_wrapWithRepairCycle` helper。Repair Engine 自动激活 `detectOCRLowConfidence` validator,把 `metadata.ocr.lines` 中 confidence < 0.55 的行转成 `replaceTextRun` 候选 action。同步 `convert()` 保留不变以保证现有调用方与 18+ 测试脚本零回归。 + +P9-A.4 之后,扫描 PDF 输入路径在 `convertAsync` 的 PDF 分支自动检测(`isScannedPdf(content)`)并走 `runScannedPdfOCRStage(model, ctx)`:rasterizer 渲染每页 → enhanceWithOCR → 多页合并;文本 PDF 沿用 P8-B 既有文本提取路径,不动。 + +P9-B 之后,扫描 PDF 路径的输出携带完整 FixedLayoutModel + SemanticDoc 双视图:`mergeOCRResultsToFixedLayout` 把每页 OCR 结果按 bbox.y → bbox.x 排序合并 + 携带 confidence,写入 `model.fixedLayout`;`fixedLayoutToSemantic` 派生 `model.blocks`;发 `MODEL_VISUAL_FIDELITY_LOST` + `MODEL_TEXT_ORDER_HEURISTIC` info warning。`defaultPdfPageRasterizer` 现在在浏览器/Tauri 自动 dynamic import `pdf-rasterizer-browser.js` + vendor pdfjs + canvas.toDataURL;Node 测试仍用 stub。 + ## 后续增强 -P8-B 的首批执行链已闭环。后续由 P9 补齐 `slide` / `fixedLayout` 路径的 writer 能力、公开 fixture 与视觉质量证据;在证据成立前不将规划边提升为实际 mapper 执行。 +P8-B 的首批执行链已闭环。后续按 2026-05-28 [lightweight-default-bundle-direction](superpowers/specs/2026-05-28-lightweight-default-bundle-direction.md) 决策推进 P9-A/B/C/D: + +- **P9-A OCR 基线**:PNG 与扫描 PDF 接入轻量 OCR(Tesseract.js / 轻量 PaddleOCR),OCR 模型资源按需下载到 model-cache,不进入默认安装包。 +- **P9-B OCR → FixedLayoutModel**:OCRResult → FixedLayoutModel → SemanticDoc,保留 bbox / confidence / page index / reading order,让 Repair Engine 的 `fixedLayoutToSemantic` 路径获得真实证据。 +- **P9-C 转换后检验**:规则 diff + SSIM 视觉对比 + OCR 回读三层组合统一写入 QualityReport,作为核心差异化能力提升。 +- **P9-D 高级 OCR**:PaddleOCR-VL / MinerU 等大模型作为独立本地资源按需下载,明确体积、运行内存、降级路径。 + +在 P9-A 启动前必须先完成 **S3 按需下载与本地缓存治理**:定义 model-cache 目录结构、manifest、checksum、可清理入口、断网降级提示和首次启用下载提示流程。`slide` / `fixedLayout` 路径的 writer 能力和视觉质量证据在 P9-B/C 推进过程中补齐;在证据成立前不将规划边提升为实际 mapper 执行。 diff --git a/docs/DESKTOP_APP_ARCHITECTURE.md b/docs/DESKTOP_APP_ARCHITECTURE.md index a64127c..03e5042 100644 --- a/docs/DESKTOP_APP_ARCHITECTURE.md +++ b/docs/DESKTOP_APP_ARCHITECTURE.md @@ -14,8 +14,12 @@ Tauri desktop shell + TypeScript conversion core + Web Worker / WASM + core local enhancement modules ++ on-demand local OCR/layout model resources (downloaded to model-cache on first enable) ++ Repair Engine ``` +默认安装包目标体积 30–80 MB;默认包不含 PaddleOCR-VL / Qwen-VL / MinerU 等 GB 级模型。OCR / 版面 / 表格能力的代码核心内置,模型资源仅在用户首次启用对应能力时下载到本地 model-cache,可清理、可禁用,处理阶段不联网、不上传任何文档内容。 + 当前浏览器 Web 应用继续作为 GUI 和转换核心验证底座。桌面化不是推翻现有实现,而是把现有 Web-GUI 放入 Tauri 壳,并补齐文件系统、版本历史和桌面权限隔离。 ## 选择判断 @@ -28,7 +32,7 @@ Tauri + Web-GUI 是当前最合理的桌面路线。 - 高体验上限:Web-GUI 更适合做现代工作台、分栏编辑、实时预览、虚拟滚动、质量报告和 diff。 - 本地安全:Tauri 的能力边界更适合做最小文件系统权限和显式目录授权。 - 技术连续性:当前浏览器转换核心、Worker、DocumentModel、前端工作台和测试可以继续复用。 -- 核心模块友好:重格式、高保真渲染器、本地 OCR/OFD 可以本地按需加载,不必进入首屏启动路径。 +- 核心模块友好:重格式、高保真渲染器、本地 OCR/OFD、文档图像文字专用模型和 Repair Engine 代码核心内置,模型资源按需本地下载到 model-cache,不必进入首屏启动路径。 主要风险: @@ -49,7 +53,7 @@ Tauri + Web-GUI 是当前最合理的桌面路线。 - 可恢复:转换失败、重能力失败、预览失败时,用户的输入、编辑和输出状态不丢。 - 信息密度合理:主界面服务操作,不堆说明文案;复杂解释放入 warnings、quality report 和详情面板。 - 安全可见:local-only、no-upload、核心本地处理、历史保存状态在关键位置可见。 -- 资源可控:默认核心轻,重格式和模型本地按需加载,可禁用、清理、回滚。 +- 资源可控:默认安装包轻量(目标 30–80 MB),重格式与 OCR 等增强模块代码核心内置;OCR 模型资源不进入默认安装包,首次启用时下载到本地 model-cache,可禁用、可清理、可回滚。 - 数据可控:版本历史默认只在会话内;持久保存必须 opt-in,并提供清除入口。 ## GUI 边界 @@ -86,6 +90,7 @@ Tauri + Web-GUI 是当前最合理的桌面路线。 | QualityReportPanel | 结构保真、视觉保真、资源保真、可读性和 chunked 等价 | | VersionHistory | session undo/redo、checkpoint、diff、持久历史 opt-in | | SecurityCenter | local-only 状态、权限、缓存、历史和清除入口 | +| Repair Engine | 执行已注册的结构化修复动作,修复后重新写出并复核,不让用户承担质量修复 | ## 安全模式 @@ -126,7 +131,9 @@ Tauri + Web-GUI 是当前最合理的桌面路线。 | 50MB+ 文件 | Worker + 渐进预览 | | 100MB+ 文件 | 分块 / 降级预览 | | 重格式 | 核心本地按需加载 | -| 本地模型 | 手动安装、手动启用 | +| 默认安装包 | 30–80 MB,不打包 GB 级模型 | +| OCR 模型 | 不进入默认安装包;首次启用时本地下载到 model-cache,可清理、可禁用 | +| 高级 OCR(PaddleOCR-VL / MinerU 等) | 独立本地资源,按需获取,明确体积与硬件要求 | ## 当前实现 diff --git a/docs/DESKTOP_RELEASE_PLAN.md b/docs/DESKTOP_RELEASE_PLAN.md index 9bc7552..55a0513 100644 --- a/docs/DESKTOP_RELEASE_PLAN.md +++ b/docs/DESKTOP_RELEASE_PLAN.md @@ -48,7 +48,9 @@ Trans2Former__checksums.sha256 - 应用可启动到主窗口。 - Markdown -> HTML、TXT -> PDF、CSV -> XLSX 基础路径可转换并下载。 - OFD、PNG/image、PDF、OOXML 等核心本地能力在无插件安装入口的情况下可见。 -- 文档处理阶段不发起网络请求。 +- 默认安装包体积控制在 30–80 MB;不内置 PaddleOCR-VL / Qwen-VL / MinerU 等 GB 级模型。 +- OCR 启用后必须可触发首次本地下载,并展示 manifest、checksum、缓存路径、分项体积报告、可清理入口和断网降级提示。 +- 文档处理阶段不发起网络请求;OCR 模型下载是显式动作,仅在用户首次启用 OCR 时联网,下载完成后所有识别在本机执行。 - 关闭再打开后不会自动恢复用户文档,除非用户显式开启历史持久化。 ## 文件关联和权限 @@ -71,7 +73,11 @@ Trans2Former__checksums.sha256 - 新增格式转换能力优先并入核心本地模块,必要时通过本地 worker、vendor 或 WASM 按需加载。 - Release 包不得包含 `plugin-patches` 或 `.t2f-plugin.json`。 - OFD、OCR、版面分析、表格恢复等能力必须声明资源预算、fallback 和兼容说明。 -- 本地模型资源必须手动安装、可删除、可禁用,不得上传数据。 +- OCR 模型资源不进入默认安装包;首次启用时本地下载到 model-cache,必须提供 manifest、checksum、缓存路径、可清理入口、体积报告、断网降级提示和失败 fallback,处理过程不上传任何文档内容。 +- `release:prepare` 必须依次执行 `sync-pdfjs-vendor` 与 `sync-tesseract-vendor`;后者在 `tesseract.js` optionalDependency 缺失时退出 0,不阻塞 CI/发布流程。 +- Tauri CSP 必须保留 `'wasm-unsafe-eval'`(让本地 tesseract.js wasm 在 WebView 中可实例化),且 `connect-src 'self'` 不可放开 —— 模型资源仅同源 vendor 与本地 IndexedDB,禁止任何远程 URL。 +- 高级 OCR 资源(PaddleOCR-VL / MinerU 等大模型)作为独立本地资源按需获取,启用前展示体积、运行内存、降级路径和失败提示。 +- 转换后检验三层(规则 diff、SSIM 视觉对比、OCR 回读)必须可在断网状态运行,验证 Repair Engine 修复后的输出质量并写入 QualityReport。 - 文档处理模式始终禁止网络访问。 ## 当前 P7 边界 diff --git a/docs/MULTI_MODEL_ARCHITECTURE.md b/docs/MULTI_MODEL_ARCHITECTURE.md index 9d89349..6d691fa 100644 --- a/docs/MULTI_MODEL_ARCHITECTURE.md +++ b/docs/MULTI_MODEL_ARCHITECTURE.md @@ -11,7 +11,8 @@ Trans2Former 的转换核心从"单一 `DocumentModel`"升级为**五个并列 - **规范模型不是文件**:是 reader/writer 之间的内存对象,不存盘、不序列化为 docx/html 文件作为中转。 - **不写 N×N 直连**:reader 输出某个规范模型,writer 消费某个规范模型,跨模型走显式 mapper。 - **降级显式可见**:跨模型 mapper 必发 warning,质量等级写入 `qualityReport`,不静默丢信息。 -- **本地优先不变**:所有模型在 worker / 主线程内部流转,不走网络。external engine 一律插件化,核心包不引依赖。 +- **本地优先不变**:所有规范模型在 worker / 主线程内部流转,不走网络;文档图像、文字、版面和表格能力采用核心本地内置模型(代码内置,模型资源按需下载到本地 model-cache,不进入默认安装包),不恢复插件安装路线。默认安装包目标体积 30–80 MB;默认包不含 GB 级模型。 +- **自动修复有边界**:模型只输出结构化质量问题和修复动作,Repair Engine 执行已注册动作并触发修复后复核,不允许模型直接替换文件字节。 ## 五个规范模型 @@ -137,12 +138,92 @@ createParagraph([ | `public/core/asset-store.js` | 演化为 `models/asset-graph.js`,跨模型共享 | | `public/core/format-registry.js` | 升级为 Capability Registry,详见 [CONVERSION_ROUTING.md](CONVERSION_ROUTING.md) | | `public/core/warnings.js` | 不变,仍然是 info / lossy / unsupported / security / performance 五级 | -| 核心本地重能力模块 | 后续承载 OCR / layout / sidecar engine,详见 [RESOURCE_BUDGET.md](RESOURCE_BUDGET.md) | +| 核心本地重能力模块 | 后续承载 OCR / layout / table / quality-reviewer 等核心本地内置模型(代码内置 + 模型资源按需下载到 model-cache),详见 [RESOURCE_BUDGET.md](RESOURCE_BUDGET.md) | +| Repair Engine | 执行结构化修复动作,记录 before/after、confidence、modelVersion 和修复后复核结果 | + +## Repair Engine 实现接口 + +S2 已落地为 `public/core/repair-engine.js`、`public/core/repair-actions.js`、`public/core/repair-handlers.js`、`public/core/repair-validators.js` 四个模块,并由 `ConverterRegistry.convert()` 在 writer 之后挂入运行。具体接口: + +- `class RepairEngine` 提供 `registerValidator(fn)` / `registerHandler(actionType, fn)` / `proposeActions(model, ctx)` / `applyActions(...)` / `reverifyModel(...)` / `reverifyRoundTrip(...)` / `runCycle(...)`。`defaultRepairEngine` 在模块顶层注册下方默认 validator/handler。 +- `REPAIR_ACTION_TYPES` 锁定 7 类动作:`replaceTextRun`、`insertTextRun`、`reorderBlocks`、`restoreTableGrid`、`adjustBoundingBox`、`regeneratePageLayout`、`selectFallbackRoute`。`createRepairAction` 返回冻结对象;`validateRepairAction` 在缺字段、未知 `actionType` 或 `confidence ∉ [0,1]` 时抛 `ConversionError(code: "REPAIR_ACTION_INVALID")`。 +- 默认 validator 两个:`detectLossyRepairHints` 把 `metadata.warnings[*].details.repairAction` 中结构化建议提取为候选动作(让未来真实模型审核可以通过这种约定接入);`detectRouteClassDegradation` 对 `metadata.conversion.routeClass ∈ {generated, restricted}` 路径建议更保守的输出格式。 +- 默认 handler 两个真实实现:`replaceTextRun` 在 SemanticDoc `block.text / code / items[N]` 上做字符串级替换并返回深拷贝模型;`selectFallbackRoute` 默认仅写入 `autoRepair.recommendations`,只有 `options.repair.applyFallback === true` 时才真正切换 writer 并替换 output。其余 5 类 handler 注册为占位 reject,等 S3/S4 落地。 +- `runCycle` 在 apply 完成后执行 model-level 复核(`ensureDocumentAudit` 重算 `qualityReport`,要求 `warningCount` / `downgradeCount` 不增加);当 `from === to` 且格式属于 `md/html/json/csv/txt/xml` 自映射时再跑 writer→reader→model 指纹 round-trip diff,结果写入 `metadata.autoRepair.roundTripDelta`,非白名单路径记 `skipped: "format-not-round-trip-safe"`。 +- `finalDecision ∈ {verified, degraded, failed-quality-gate}` 由 cycle 决定:无候选动作或复核通过 → `verified`;候选全被 reject → `degraded`;apply 后复核失败 → `failed-quality-gate`。 +- `convert()` 返回 `{ ...writerOutput, quality: { qualityReport, modelReview, autoRepair, conversion } }`;`options.repair === false` 跳过 cycle 并保留旧返回结构,给 legacy 测试和未来需要纯净输出的调用方留口子。 + +## Model Cache 模块(S3 落地) + +`public/core/model-cache/` 是 P9-A 即将接入的本地模型 manifest + 校验 + 状态机基础设施,详见 [docs/superpowers/specs/2026-05-28-on-demand-model-cache-design.md](superpowers/specs/2026-05-28-on-demand-model-cache-design.md)。要点: + +- `createModelManifest({ manifestId, task, engine, modelVersion, bundleSize, ... })` 返回冻结的 `ModelManifest`,`validateModelManifest` 在缺字段、未知 task/engine、非 SHA-256 等情况下抛 `MODEL_MANIFEST_INVALID`。 +- `MODEL_TASKS = ["ocr-text", "ocr-layout", "ocr-table", "quality-reviewer"]`;`MODEL_ENGINES = ["tesseract", "paddleocr", "paddleocr-vl", "mineru", "custom"]`。 +- `sha256Hex` / `verifyChecksum` 用 `crypto.subtle.digest`,模型导入后必须经过 SHA-256 校验才能进入 `available` 状态。 +- `defaultModelCache`(`ModelCacheRegistry` 单例)维护 `manifestId → { manifest, status, detail }` 内存状态机,状态包括 `not-downloaded / importing / verifying / available / degraded / disabled`;`onChange` 让 UI 实时刷新。 +- `getCacheDirectory({ task, engine, modelVersion })` 统一返回 `model-cache///`,绝不写入用户数据或动态时间戳。 +- 安全中心 dialog 的「模型缓存」card 自动渲染当前注册的 manifest 列表;S3 阶段无 register 调用,显示空状态。 + +## OCR Engine 模块(P9-A.1 落地) + +`public/core/ocr/` 是 P9-A.2 接入真实 OCR 运行时的契约层,详见 [docs/superpowers/specs/2026-05-28-p9a-ocr-baseline-design.md](superpowers/specs/2026-05-28-p9a-ocr-baseline-design.md)。要点: + +- `createOCRResult` / `validateOCRResult` / `summarizeOCRResult` 定义 OCRResult 数据契约(含 language / pages / lines / confidence / bbox / runtimeMs / engine / modelVersion)。 +- `OCREngine` 接口:`{ id, taskCapabilities, manifestId?, isAvailable(): boolean, recognize(): Promise }`。 +- `defaultOCRRegistry`(`OCREngineRegistry` 单例)维护已注册 engine;`pickForTask("ocr-text")` 优先返回 `isAvailable() === true` 的 engine,都不可用时 fallback 到最后一条。 +- `placeholderOCREngine` 总是 unavailable,`recognize` 抛 `OCR_UNAVAILABLE`;`ocr-bootstrap.js` 在 import 时把它注册到 registry 与 defaultModelCache(status: disabled)。 +- `OCR_UNAVAILABLE` (info) / `OCR_LOW_CONFIDENCE` (lossy) / `OCR_ENGINE_FAILED` (lossy) / `OCR_DEGRADED_ROUTE` (info) 是 OCR 链路的统一 warning 编号。 +- PNG reader 在 engine 不可用时注入 `OCR_UNAVAILABLE`,但保留 image asset + heading 流程,不阻塞 md/txt/html 输出。 +- P9-A.2 真实 engine 接入只需 `defaultOCRRegistry.register(engine)`;reader 流程、warning 编号、UI 显示路径都保持不变。 + +### Tesseract Runtime(P9-A.2 落地) + +`public/core/ocr/tesseract-engine.js` 与 `tesseract-bootstrap.js` 提供第一条 OCR runtime 候选,详见 [docs/superpowers/specs/2026-05-28-p9a2-tesseract-runtime-design.md](superpowers/specs/2026-05-28-p9a2-tesseract-runtime-design.md)。要点: + +- `tesseractOCREngine.id = "tesseract-zh-en"`,`manifestId = "ocr-text.tesseract.5.0.0"`。 +- `isAvailable()` 同步检查 `globalThis.__t2fTesseractVendorReady` + storage 中是否有 tessdata;A.2 阶段始终返回 false(A.2.b 接入 IDB 与 UI 启用按钮后才会变 true)。 +- `recognize` 三段拒绝路径:vendor-not-ready / tessdata-missing / runtime-not-wired;A.2.b 替换 runtime-not-wired 分支为真实推理。 +- `OCRStorage` 抽象:`InMemoryStorage`(Node 测试 + 浏览器降级)+ `createIndexedDBStorage(dbName)`(A.2 占位,A.2.b 接入真实 IDB I/O)+ `defaultOCRStorage` 单例。 +- vendor 资源通过 `scripts/sync-tesseract-vendor.js` 同步到 `public/vendor/tesseract/{core,worker}/`;缺包时 exit 0 不阻塞。 +- Tauri CSP 增加 `'wasm-unsafe-eval'` 让 wasm 实例化;`connect-src 'self'` 保留,vendor 资源同源加载、tessdata 由用户本地选择文件,不联网。 + +### tessdata IndexedDB 持久化 + 安全中心启用流程(P9-A.2.b) + +- `IndexedDBStorage` (`public/core/ocr/indexeddb-storage.js`) 实现 `OCRStorage` 接口;单数据库 `trans2former-ocr-cache`,`tessdata` 存 ArrayBuffer + `metadata` 存 `{ size, sha256, updatedAt }`,put 单事务原子写两 store。 +- `createIndexedDBStorage` 现在返回 `LazyIndexedDBStorage`(dynamic import IDB 实现);Node / 无 IDB 环境继续回退 `InMemoryStorage`。 +- `loadTesseractRuntime` (`public/core/ocr/tesseract-runtime.js`) 动态 import `/vendor/tesseract/core/tesseract.min.js`;失败抛 `OCR_VENDOR_LOAD_FAILED`。`createTesseractWorker` 用 vendor 路径 + tessdata blob URL 实例化 worker;`runRecognize` 把 tesseract data 映射成 `OCRResult`。 +- `TesseractEngine.recognize` 真实接入:`loadTesseractRuntime → defaultOCRStorage.get tessdata → createTesseractWorker → runRecognize → disposeWorker`。错误统一抛 `OCR_ENGINE_FAILED` 含 cause。 +- 安全中心「模型缓存」card 对 tesseract 行渲染三个按钮:「导入 chi_sim.traineddata」/「导入 eng.traineddata」/「清除缓存」。点击导入按钮 → `` → `arrayBuffer` → `sha256Hex` → `defaultOCRStorage.put` → `tesseractOCREngine.ensureProbe()` → `markTesseractVendorReady(true)` → `setStatus(STATUS_AVAILABLE)`。 +- `enhanceWithOCR(model, { engine })` (`public/core/ocr/png-ocr.js`) 作为独立函数提供:找第一个 image asset → engine.recognize → 追加 paragraph blocks + 写 `metadata.modelReview` 含 `summarizeOCRResult` + 在 `metadata.ocr.lines` 持久化每条 line 的 confidence/bbox/blockId;低置信度发 `OCR_LOW_CONFIDENCE`、engine 不可用发 `OCR_UNAVAILABLE`。 + +### PNG 异步 OCR 管线 + Repair Engine OCR 入口(P9-A.3 落地) + +- `ConverterRegistry.convertAsync(payload)` + `convertContentAsync(payload)` 顶层 export:与 sync `convert()` 同入口校验,但 PNG 输入 + `options.ocr.enabled !== false` 时自动 `await runOCRStage(model, ctx)` 注入 OCR enhancement;其后走与 sync convert 共享的 `_wrapWithRepairCycle` helper。**现有 sync `convert()` / `convertContent()` 完全不变**。 +- `runOCRStage(model, ctx)` (`public/core/ocr/ocr-stage.js`) 包一层 `enhanceWithOCR`,错误兜底为 `OCR_ENGINE_FAILED` warning 返回原 model,不阻塞 writer。 +- `detectOCRLowConfidence` (`public/core/ocr/ocr-validator.js`) 现在是 `defaultRepairEngine` 的默认 validator:从 `metadata.ocr.lines` 取 confidence < 0.55 的行生成 `replaceTextRun` 候选 action,evidence 含 engineId / language / bbox / pageIndex / lineIndex;每页最多 8 条。P9-B 真模型审核接入后这些候选可以自动 apply 到 SemanticDoc。 +- workbench `convertWithWorker` 检测 PNG 输入时改走 `convertContentAsync`,让用户上传 PNG → 转 md/txt/html 时自动包含 OCR 文本(前提:已通过安全中心导入 tessdata)。 +- 真实 PNG fixture:`samples/png/t2f-sample.data-url.txt`(80×24 灰度 PNG)用于 stub 测试与浏览器端手动 OCR 验证。 + +### 扫描 PDF OCR 管线(P9-A.4 落地) + +- `isScannedPdf(content, options)` (`public/core/ocr/pdf-rasterizer.js`):基于 `expandPdfContentForTextExtraction` 检测 PDFJS_PAYLOAD 标记是否存在 + 提取字符数 < 阈值(默认 300)双重启发式;不带 payload → 扫描;带 payload 但低于阈值 → 扫描;其余 → 文本 PDF。 +- `PdfPageRasterizer` 抽象 + `defaultPdfPageRasterizer` 单例(Node 默认抛 `OCR_RASTERIZER_UNAVAILABLE`)+ `setPdfPageRasterizer(impl)` / `resetPdfPageRasterizer()` 让测试注入 stub;真实浏览器端 pdfjs canvas 渲染留给 P9-B。 +- `runScannedPdfOCRStage(model, ctx)` (`public/core/ocr/scan-pdf-stage.js`):拿 rasterizer + engine → countPages → 循环 rasterize 每页 → engine.recognize → 把多页 paragraph blocks 追加到 model + `metadata.ocr.lines` 含 `pageIndex` / `blockId` + `metadata.modelReview.ocr` 总览。错误统一注入 `OCR_ENGINE_FAILED` warning 返回原 model。 +- `convertContentAsync({ from: "pdf", to: "..." })` 自动检测扫描 PDF 并接 OCR stage;文本 PDF 沿用 P8-B 路径。Repair Engine 的 `detectOCRLowConfidence` validator 自动复用扫描 PDF 输出。 + +### FixedLayoutModel 接入 + 浏览器 rasterize 真实化(P9-B 落地) + +- `ocrResultToFixedLayoutPage(result, options)` (`public/core/ocr/ocr-to-fixed-layout.js`):单页 OCR result → FixedLayoutModel.page,textRuns 按 bbox.y → bbox.x 排序、携带 confidence、`readingOrderHint = "heuristic-yx"`。 +- `mergeOCRResultsToFixedLayout(results)`:多页 OCR result 合并为 FixedLayoutModel,写 `metadata.readingOrder` + `metadata.ocr` 总览(pageCount / textRunCount / averageConfidence / runtimeMs / engine / language)。 +- `runScannedPdfOCRStage` 收集每页 pageResult 后调上面构造 FixedLayoutModel,挂到 `model.fixedLayout`;再用现有 `fixedLayoutToSemantic` 派生 paragraph blocks 追加到 `model.blocks`。同时发两条 info warning:`MODEL_VISUAL_FIDELITY_LOST`(OCR 不还原版面) + `MODEL_TEXT_ORDER_HEURISTIC`(阅读顺序是粗糙启发式)。`metadata.modelReview.ocr.fixedLayout = getFixedLayoutSummary(...)` 附摘要。 +- `defaultPdfPageRasterizer` 重构为分层 fallback:`inject (setPdfPageRasterizer)` → 浏览器自动(首次调用时 dynamic import `pdf-rasterizer-browser.js`,工厂 `createBrowserPdfPageRasterizer` 内部 dynamic import `/vendor/pdfjs/pdf.min.mjs` + canvas + page.render + canvas.toDataURL)→ throw `OCR_RASTERIZER_UNAVAILABLE`。Node 测试用 stub;浏览器/Tauri 自动加载。 +- `createTextRun` 新增 `confidence` 字段(clamp 到 [0,1]);`createPage` 新增 `readingOrderHint` 字段。 +- 不实现高级阅读顺序(multi-column / heading detection)——留给 P9-C / P9-D。 ## 不做什么(明确边界) - **不引入 DOCX / HTML / PDF 文件级 pivot**:pivot 是内存对象,不是落盘文件。 -- **不在首屏核心路径引入 LibreOffice / Pandoc / OCR**:external engine 必须作为核心本地按需能力设计。 +- **不在首屏核心路径引入 LibreOffice / Pandoc / OCR**:核心本地内置模型代码可以核心打包,但模型资源必须按需下载到 model-cache,不能阻塞启动路径,也不进入默认安装包。 - **不破坏 local-only / no-network processing**:所有模型流转在浏览器内或桌面 worker 内。 - **不允许任何 mapper 静默丢信息**:跨模型必发 warning,并写入 qualityReport。 - **不强求 14×11 矩阵全可用**:UI 显示不推荐路径但加严重 warning,不"假装能用"。 diff --git a/docs/PRODUCT_STRATEGY.md b/docs/PRODUCT_STRATEGY.md index d50200a..4cf5090 100644 --- a/docs/PRODUCT_STRATEGY.md +++ b/docs/PRODUCT_STRATEGY.md @@ -16,6 +16,8 @@ Tauri 桌面壳 + TypeScript 转换核心 + Web Worker / WASM + 核心本地增强模块 ++ 文档图像、文字、版面和表格专用本地模型 ++ 软件自动修复 Repair Engine ``` 当前仓库仍保留浏览器 Web 应用作为转换核心和界面验证底座;后续桌面版本必须复用这套 Web-GUI 和本地转换核心,而不是引入 Electron、远程转换 API 或传统原生 GUI 重写。 @@ -27,8 +29,9 @@ Tauri 桌面壳 - 桌面 Web-GUI:Tauri 承载 Web-GUI,Web 技术负责编辑、预览、质量报告和版本视图。 - 本地转换核心:不依赖 Microsoft Office、LibreOffice、Pandoc、Electron、Playwright 或云端转换服务。 - 中间模型优先:新增格式必须走 `input -> DocumentModel -> output`,避免 `N * N` 转换路线。 -- 核心模块优先:核心包保持小而可用,热门基础格式免下载,重格式和可选能力进入核心本地按需加载路线。 -- 高保真攻坚:难格式不是回避项,OFD、PDF、Office 复杂文档和本地 OCR 是长期差异化攻坚方向。 +- 软件自动修复:质量问题由本地模型识别并由 Repair Engine 执行结构化修复,用户不承担逐项修复判断。 +- 核心模块优先:默认安装包保持轻量(目标 30–80 MB);热门基础格式免下载;重格式与 OCR 等增强能力代码核心内置,模型资源按需下载到本地 model-cache,默认包不含 GB 级模型。 +- 高保真攻坚:难格式不是回避项,OFD、PDF、Office 复杂文档和本地 OCR 是长期差异化攻坚方向;高级 OCR(PaddleOCR-VL / MinerU 等)作为独立本地资源按需获取,不与默认安装包绑定。 - 可验证交付:每个阶段都必须有样例、自动化测试、质量报告和可解释 warnings。 ## 产品壁垒 @@ -38,10 +41,11 @@ Tauri 桌面壳 - 实时预览:输入编辑、格式选择、转换 warnings、质量报告和输出预览尽量实时反馈。 - 上传文件大小无限制目标:不设置人为固定 MB/GB 上限;实际能力由用户设备资源决定。 - 动态分块不降质:单个超大文件可按语义子模块转换后合并,结果必须与直接转换语义等价。 -- 行业顶尖质量:格式适配器必须有样例、快照、降级说明、性能基准和人工可读质量准则。 +- 行业顶尖质量:格式适配器必须有样例、快照、降级说明、性能基准、软件自动修复记录和人工可读质量准则。 - OFD 高保真:OFD 是政务、公文、票据场景的战略攻坚格式,不以“能打开”为目标,而以本地高保真和可解释质量报告为目标。 - 热门格式免下载:基础包必须覆盖 Markdown、HTML、TXT、JSON、CSV、XML、常见图片、PDF text、DOCX/XLSX/PPTX/EPUB input 等高频路径。 -- 按需能力扩展:系统只为当前任务加载需要的重格式、高保真渲染器、本地模型或 OFD 核心模块,减少默认资源占用。 +- 按需能力扩展:系统只为当前任务加载需要的重格式、高保真渲染器、文档图像、文字、版面和表格专用本地模型或 OFD 核心模块,减少默认资源占用;默认包不含 GB 级模型,OCR 模型首次启用时本地下载到 model-cache。 +- 转换后检验差异化:规则 diff(页数、文本长度、表格数量、空白页、乱码等确定性检查)+ SSIM 视觉对比 + OCR 回读组成核心质量证据链,统一写入 QualityReport。 ## 体验目标 diff --git a/docs/RESOURCE_BUDGET.md b/docs/RESOURCE_BUDGET.md index 775c924..0373dcb 100644 --- a/docs/RESOURCE_BUDGET.md +++ b/docs/RESOURCE_BUDGET.md @@ -35,7 +35,7 @@ format-heavy-core - Release 的 `RELEASE_MANIFEST.json` manifest 必须记录核心本地能力、样例、预算和生成时间。 - 文档处理、预览、编辑和导出阶段必须禁联网。 -## 当前预算 +## 轻量核心预算 - `public/core`: <= 0.25 MB - `public/formats`: <= 0.50 MB @@ -47,6 +47,22 @@ format-heavy-core 这些预算是当前阶段的护栏,不是最终能力上限。未来引入重格式时,应放入核心按需加载目录,并同步调整预算。 +## OCR 模型缓存目录预算 + +OCR / 版面 / 表格能力的代码核心内置,但**模型资源不进入默认安装包**。默认安装包目标体积 30–80 MB;模型资源仅在用户首次启用对应能力时下载到本地 model-cache 目录。 + +- 默认安装包构建后必须报告主程序、轻量依赖、空 model-cache 占位的分项体积总和,目标 30–80 MB。 +- 任何 GB 级模型(PaddleOCR-VL / Qwen-VL / MinerU 等)不得进入默认 dependencies 或安装包本体。 +- model-cache 目录必须支持:manifest 记录每个模型资产的版本、checksum、量化方式、任务范围、最低内存和 fallback;用户可见的缓存路径、清理入口、禁用入口;断网降级提示与失败 fallback。 +- 缓存包只保存推理资源,不保存训练检查点、优化器状态、标注数据、调试样本或任何用户文档内容。 +- OCR、layout、table、quality-reviewer 共享资源必须去重,避免重复下载 tokenizer、字典、字体、运行库或视觉 backbone。 +- 轻量 OCR(Tesseract.js / 轻量 PaddleOCR)与高级 OCR(PaddleOCR-VL / MinerU)使用独立缓存条目;高级 OCR 启用前展示体积、运行内存、降级路径和失败提示。 +- 具体 MB/GB 上限以首个可运行 OCR 模型构建后的质量、速度、内存测试结果确定,不沿用默认安装包预算。 + +### model-cache 目录结构 + +S3 已经落地 `public/core/model-cache/` 模块骨架。所有模型资源必须遵守统一目录约定 `model-cache////`,task ∈ `{ocr-text, ocr-layout, ocr-table, quality-reviewer}`,engine ∈ `{tesseract, paddleocr, paddleocr-vl, mineru, custom}`,由 `getCacheKey` / `getCacheDirectory` / `getCacheFilePath` 统一推导,禁止使用 `..` / 绝对路径 / 反斜杠。Manifest 强制 SHA-256 checksum,校验前状态停留在 `verifying`;校验失败进入 `degraded` 并保留可清理入口。详见 [docs/superpowers/specs/2026-05-28-on-demand-model-cache-design.md](superpowers/specs/2026-05-28-on-demand-model-cache-design.md)。 + ## 核心重能力预算原则 - 重能力依赖不得进入默认 `dependencies`,除非该能力转为 `format-basic` 并通过资源预算评审。 diff --git a/docs/superpowers/plans/2026-05-28-local-model-output-closure-s1.md b/docs/superpowers/plans/2026-05-28-local-model-output-closure-s1.md new file mode 100644 index 0000000..5a8a14c --- /dev/null +++ b/docs/superpowers/plans/2026-05-28-local-model-output-closure-s1.md @@ -0,0 +1,384 @@ +# Local Model Output Closure S1 Implementation Plan + +> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking. + +> **2026-05-28 修订声明**:S1 已于 2026-05-28 落地。其中「模型资源随正式安装包交付」「安装包内置、按需加载、可禁用」等方向描述已被同日 [../specs/2026-05-28-lightweight-default-bundle-direction.md](../specs/2026-05-28-lightweight-default-bundle-direction.md) 替换为「OCR 模型资源不进入默认安装包;首次启用时本地下载到 model-cache」。对应守门测试 `scripts/local-model-direction-test.js` 关键词已同步。本 plan 中文件改动清单与测试关键词作为历史快照保留,不再作为后续工作真值。 + +**Goal:** Establish the first development slice for local-model output closure by making the product matrix and direction docs testable truth surfaces before implementing Repair Engine or high-fidelity writers. + +**Architecture:** Keep runtime conversion behavior unchanged in S1. Add focused Node test scripts that compare `docs/CONVERSION_PATHS.md` to `getAllowedOutputFormats()` and assert the approved local-model direction appears in active docs. Then update only docs and package test wiring until these gates pass. + +**Tech Stack:** Browser ES modules, Node.js assertion scripts, existing `public/browser-transformer.js` registry exports, Markdown docs, `npm test`. + +--- + +## File Map + +| File | Responsibility in this change | +| --- | --- | +| `scripts/product-matrix-docs-test.js` | New test that parses `docs/CONVERSION_PATHS.md` and compares each documented row to `getAllowedOutputFormats()`. | +| `scripts/local-model-direction-test.js` | New test that guards the approved bundled local-model, auto-repair, and no-cloud direction across active docs. | +| `package.json` | Add both new test scripts to the main `npm test` chain. | +| `docs/CONVERSION_PATHS.md` | Sync documented rows with the current code matrix, including XML routes and an explicit XML input row. | +| `DEVELOPMENT_TASKS.md` | Replace the stale “不依赖 OCR/AI” wording with the approved no-cloud plus bundled local-model direction and add S1/S2 next-step language. | +| `docs/DESKTOP_APP_ARCHITECTURE.md` | Replace manual local-model install language with bundled, on-demand, disableable model resources. | +| `docs/DESKTOP_RELEASE_PLAN.md` | Replace manual model resource wording with bundled model manifest, checksum, size report, and offline smoke language. | +| `docs/RESOURCE_BUDGET.md` | Split lightweight core budget from model-enhanced desktop package budget. | +| `docs/PRODUCT_STRATEGY.md` | Make software-owned automatic repair and document-specialized local models explicit. | +| `docs/MULTI_MODEL_ARCHITECTURE.md` | Replace plugin/external-engine stale wording with core bundled local model and Repair Engine boundaries. | + +## Task 1: Add Product Matrix Documentation Gate + +**Files:** +- Create: `scripts/product-matrix-docs-test.js` +- Modify: `package.json` +- Modify: `docs/CONVERSION_PATHS.md` + +- [ ] **Step 1: Write the failing product matrix docs test** + +Create `scripts/product-matrix-docs-test.js`: + +```js +import assert from "node:assert/strict"; +import { readFile } from "node:fs/promises"; + +import { getAllowedOutputFormats } from "../public/browser-transformer.js"; + +const docs = await readFile("docs/CONVERSION_PATHS.md", "utf8"); + +const inputNameToFormats = new Map([ + ["Markdown", ["md"]], + ["HTML", ["html"]], + ["TXT", ["txt"]], + ["JSON", ["json"]], + ["XML", ["xml"]], + ["CSV", ["csv"]], + ["XLSX", ["xlsx"]], + ["DOC / DOCX", ["doc", "docx"]], + ["EPUB", ["epub"]], + ["PDF", ["pdf"]], + ["PPTX", ["pptx"]], + ["PNG", ["png"]], + ["OFD", ["ofd"]], +]); + +const outputNameToFormat = new Map([ + ["Markdown", "md"], + ["HTML", "html"], + ["TXT", "txt"], + ["JSON", "json"], + ["CSV", "csv"], + ["XML", "xml"], + ["DOCX", "docx"], + ["XLSX", "xlsx"], + ["PDF", "pdf"], + ["EPUB", "epub"], + ["PPTX", "pptx"], +]); + +function parseMatrixRows(markdown) { + const rows = new Map(); + for (const line of markdown.split(/\r?\n/)) { + if (!line.startsWith("| ")) continue; + if (line.includes("---")) continue; + const cells = line.split("|").slice(1, -1).map((cell) => cell.trim()); + if (cells.length < 3 || cells[0] === "输入") continue; + if (!inputNameToFormats.has(cells[0])) continue; + const outputs = cells[1] + .split("、") + .map((name) => name.trim()) + .filter(Boolean) + .map((name) => { + assert.equal(outputNameToFormat.has(name), true, `Unknown documented output name: ${name}`); + return outputNameToFormat.get(name); + }); + for (const format of inputNameToFormats.get(cells[0])) { + rows.set(format, outputs); + } + } + return rows; +} + +const documentedRows = parseMatrixRows(docs); +for (const format of inputNameToFormats.values()) { + for (const alias of format) { + assert.equal(documentedRows.has(alias), true, `docs/CONVERSION_PATHS.md must document ${alias}`); + } +} + +for (const format of documentedRows.keys()) { + assert.deepEqual( + documentedRows.get(format), + getAllowedOutputFormats(format), + `docs/CONVERSION_PATHS.md output row for ${format} must match getAllowedOutputFormats(${format})` + ); +} + +console.log("Product matrix docs test passed: CONVERSION_PATHS matches the registry product matrix."); +``` + +- [ ] **Step 2: Run the new test and confirm it fails** + +Run: + +```powershell +node scripts/product-matrix-docs-test.js +``` + +Expected: FAIL because `docs/CONVERSION_PATHS.md` lacks the `XML` input row and omits several current `-> XML` outputs. + +- [ ] **Step 3: Add the test to `package.json`** + +Update the `test` script so it includes: + +```json +"node scripts/product-matrix-docs-test.js" +``` + +Place it after `node scripts/conversion-capability-audit-test.js` so the code matrix and documented matrix fail close together. + +- [ ] **Step 4: Sync `docs/CONVERSION_PATHS.md` with the current registry** + +Update the matrix rows to match: + +```text +Markdown: Markdown、HTML、TXT、JSON、CSV、XML、DOCX、XLSX、PDF、EPUB、PPTX +HTML: Markdown、HTML、TXT、JSON、CSV、XML、DOCX、XLSX、PDF、EPUB、PPTX +TXT: Markdown、HTML、TXT、JSON、XML、DOCX、PDF、EPUB +JSON: Markdown、HTML、TXT、JSON、CSV、XML、DOCX、XLSX、PDF、EPUB、PPTX +XML: Markdown、HTML、TXT、JSON、XML、PDF +CSV: Markdown、CSV、XLSX、HTML、TXT、JSON、XML、PDF +XLSX: Markdown、CSV、XLSX、HTML、TXT、JSON、XML、PDF +DOC / DOCX: Markdown、HTML、TXT、JSON、XML、DOCX、PDF +EPUB: Markdown、HTML、TXT、JSON、XML、DOCX、PDF、EPUB +PDF: Markdown、HTML、TXT、JSON、XML、DOCX、PDF +PPTX: Markdown、HTML、TXT、JSON、XML、PDF、PPTX +PNG: HTML、TXT、JSON、PDF +OFD: Markdown、HTML、TXT、JSON、XML、PDF +``` + +Keep caveats truthful: PNG remains input-only for now, OFD remains L0/restricted for high-fidelity output, and PPTX write-back remains generated. + +- [ ] **Step 5: Run the new test and confirm it passes** + +Run: + +```powershell +node scripts/product-matrix-docs-test.js +``` + +Expected: PASS with `Product matrix docs test passed`. + +- [ ] **Step 6: Commit the matrix gate** + +Run: + +```powershell +git add package.json scripts/product-matrix-docs-test.js docs/CONVERSION_PATHS.md +git commit -m "test: gate product matrix docs against registry" +``` + +## Task 2: Add Local Model Direction Documentation Gate + +**Files:** +- Create: `scripts/local-model-direction-test.js` +- Modify: `package.json` +- Modify: `DEVELOPMENT_TASKS.md` +- Modify: `docs/DESKTOP_APP_ARCHITECTURE.md` +- Modify: `docs/DESKTOP_RELEASE_PLAN.md` +- Modify: `docs/RESOURCE_BUDGET.md` +- Modify: `docs/PRODUCT_STRATEGY.md` +- Modify: `docs/MULTI_MODEL_ARCHITECTURE.md` + +- [ ] **Step 1: Write the failing local-model direction test** + +Create `scripts/local-model-direction-test.js`: + +```js +import assert from "node:assert/strict"; +import { readFile } from "node:fs/promises"; + +const files = { + tasks: await readFile("DEVELOPMENT_TASKS.md", "utf8"), + desktop: await readFile("docs/DESKTOP_APP_ARCHITECTURE.md", "utf8"), + release: await readFile("docs/DESKTOP_RELEASE_PLAN.md", "utf8"), + budget: await readFile("docs/RESOURCE_BUDGET.md", "utf8"), + strategy: await readFile("docs/PRODUCT_STRATEGY.md", "utf8"), + multiModel: await readFile("docs/MULTI_MODEL_ARCHITECTURE.md", "utf8"), +}; + +function assertIncludes(fileKey, expected) { + assert.equal( + files[fileKey].includes(expected), + true, + `${fileKey} should mention: ${expected}` + ); +} + +function assertExcludes(fileKey, forbidden) { + assert.equal( + files[fileKey].includes(forbidden), + false, + `${fileKey} should no longer mention stale wording: ${forbidden}` + ); +} + +assertIncludes("tasks", "不依赖 Office、LibreOffice、Pandoc、云端转换或云端 OCR/AI"); +assertIncludes("tasks", "内置本地专用模型"); +assertIncludes("tasks", "Repair Engine"); +assertExcludes("tasks", "不依赖 Office、LibreOffice、Pandoc、云端转换或 OCR/AI"); + +assertIncludes("desktop", "安装包内置、按需加载、可禁用"); +assertIncludes("desktop", "Repair Engine"); +assertExcludes("desktop", "手动安装、手动启用"); + +assertIncludes("release", "内置模型 manifest"); +assertIncludes("release", "模型资源随正式安装包交付"); +assertIncludes("release", "离线修复 smoke"); +assertExcludes("release", "本地模型资源必须手动安装"); + +assertIncludes("budget", "轻量核心预算"); +assertIncludes("budget", "模型增强桌面包预算"); +assertIncludes("budget", "模型资源随安装包交付"); + +assertIncludes("strategy", "软件自动修复"); +assertIncludes("strategy", "文档图像、文字、版面和表格专用本地模型"); + +assertIncludes("multiModel", "Repair Engine"); +assertIncludes("multiModel", "核心本地内置模型"); +assertExcludes("multiModel", "external engine 一律插件化"); + +console.log("Local model direction test passed: active docs match bundled local-model auto-repair direction."); +``` + +- [ ] **Step 2: Run the new test and confirm it fails** + +Run: + +```powershell +node scripts/local-model-direction-test.js +``` + +Expected: FAIL on stale manual-install or no-OCR/AI wording. + +- [ ] **Step 3: Add the test to `package.json`** + +Update the `test` script so it includes: + +```json +"node scripts/local-model-direction-test.js" +``` + +Place it near `node scripts/local-security-test.js` and `node scripts/resource-budget-test.js`. + +- [ ] **Step 4: Update active docs without claiming unfinished runtime capability** + +Make these wording changes: + +```text +DEVELOPMENT_TASKS.md: +- Replace stale no-OCR/AI wording with no-cloud OCR/AI plus bundled local specialist model direction. +- Add S1/S2 next steps for matrix truth and Repair Engine contract. + +docs/DESKTOP_APP_ARCHITECTURE.md: +- Change local model row to “安装包内置、按需加载、可禁用”. +- Add Repair Engine as an automatic repair boundary, not a user manual repair flow. + +docs/DESKTOP_RELEASE_PLAN.md: +- Replace manual model install with bundled model resources, manifest, checksum, size report, and offline repair smoke. + +docs/RESOURCE_BUDGET.md: +- Split current light core budgets from future model-enhanced desktop package budgets. +- State model resources ship with the installer but are not part of first-screen startup path. + +docs/PRODUCT_STRATEGY.md: +- Add software automatic repair and document-specialized local model wording. + +docs/MULTI_MODEL_ARCHITECTURE.md: +- Replace stale pluginized external engine sentence with core bundled local-model and Repair Engine boundaries. +``` + +- [ ] **Step 5: Run the focused direction test and confirm it passes** + +Run: + +```powershell +node scripts/local-model-direction-test.js +``` + +Expected: PASS with `Local model direction test passed`. + +- [ ] **Step 6: Commit the direction gate** + +Run: + +```powershell +git add package.json scripts/local-model-direction-test.js DEVELOPMENT_TASKS.md docs/DESKTOP_APP_ARCHITECTURE.md docs/DESKTOP_RELEASE_PLAN.md docs/RESOURCE_BUDGET.md docs/PRODUCT_STRATEGY.md docs/MULTI_MODEL_ARCHITECTURE.md +git commit -m "docs: align local model auto-repair direction" +``` + +## Task 3: Run S1 Verification + +**Files:** +- Verify only. + +- [ ] **Step 1: Run focused S1 tests** + +Run: + +```powershell +node scripts/product-matrix-docs-test.js +node scripts/local-model-direction-test.js +node scripts/conversion-capability-audit-test.js +node scripts/local-security-test.js +node scripts/resource-budget-test.js +``` + +Expected: all commands pass. + +- [ ] **Step 2: Run full project test** + +Run: + +```powershell +npm test +``` + +Expected: PASS. The final output should include the existing smoke, capability, security, budget, and release readiness pass messages plus the two new S1 gates. + +- [ ] **Step 3: Run release preparation and whitespace checks** + +Run: + +```powershell +npm run release:prepare +git diff --check +git check-ignore -v release\trans2former-2.2.0\RELEASE_MANIFEST.json +``` + +Expected: + +- `npm run release:prepare` succeeds. +- `git diff --check` produces no output. +- `git check-ignore` reports that the release manifest is ignored. + +- [ ] **Step 4: Update `DEVELOPMENT_TASKS.md` completion note if verification passes** + +Add a recent validation note that S1 matrix and local-model direction gates are in place, without marking Repair Engine, bundled runtime, PNG output, OFD output, PDF high-fidelity recovery, or PPTX high-fidelity recovery as complete. + +- [ ] **Step 5: Commit final task status if changed** + +Run: + +```powershell +git add DEVELOPMENT_TASKS.md +git commit -m "docs: record local model closure s1 validation" +``` + +Only run this commit if Step 4 changed `DEVELOPMENT_TASKS.md` after the Task 2 commit. + +## Self-Review + +- Spec coverage: this plan covers S1 from the approved design: matrix truth, stale doc conflict removal, bundled local-model direction, and explicit non-claiming of unfinished runtime capability. Repair Engine, model runtime, PNG/OFD writers, PDF recovery, and PPTX high-fidelity work are intentionally deferred to later plans. +- Placeholder scan: this plan contains no `TBD`, `TODO`, or unspecified test/write steps. +- Type consistency: test names and file names are consistent across tasks: `product-matrix-docs-test.js`, `local-model-direction-test.js`, `getAllowedOutputFormats()`, and active docs listed in the S1 spec. diff --git a/docs/superpowers/specs/2026-05-27-local-document-model-auto-repair-output-closure-design.md b/docs/superpowers/specs/2026-05-27-local-document-model-auto-repair-output-closure-design.md new file mode 100644 index 0000000..630e2bf --- /dev/null +++ b/docs/superpowers/specs/2026-05-27-local-document-model-auto-repair-output-closure-design.md @@ -0,0 +1,353 @@ +# Local Document Model Auto-Repair And Output Closure Design + +状态:部分被替换 +日期:2026-05-27 +适用阶段:P9 质量证据升级之后的本地模型与输出闭环主线 +前置基础:P7-A Windows 安装包构建基线、P8-B 可执行 Workbook/Semantic 路由与高风险路径分级 + +> **2026-05-28 修订声明**:本 spec 中「模型随桌面安装包内置」「正式安装包包含 document recognition and review model assets」「内置模型 manifest 随包交付」等结论已被 [2026-05-28-lightweight-default-bundle-direction.md](2026-05-28-lightweight-default-bundle-direction.md) 替换,调整为「OCR 模型资源不进入默认安装包;首次启用时本地下载到 model-cache」。Repair Engine、规范模型、自动修复闭环、QualityReport 数据契约和模型职责分层等设计仍然生效。阶段编排 S3 重新定义为「按需下载与缓存治理」;后续推进 P9-A/B/C/D(OCR 基线 → OCR→Model → 转换后检验 → 高级 OCR)。 + +## 目标 + +将 Trans2Former 从“可转换并展示降级提示”的本地桌面工作台推进为“可自动识别转换缺陷、自动修复并复核输出质量”的专业桌面软件,同时补齐当前已纳入产品范围的关键输出缺口。 + +本设计固定以下产品结论: + +- 正式产品采用 `Tauri + Web-GUI + TypeScript conversion core + local model runtime`。 +- 本地模型随桌面安装包内置,用户安装后无需单独下载模型或执行命令行配置。 +- 文档处理、识别、修复、复核与导出阶段均在本地执行,不联网、不上传文档内容或诊断数据。 +- 模型使用针对文档图像、文字、版面和表格处理的专用模型,不以内置通用聊天大模型替代转换内核。 +- 用户不承担质量修复操作;软件必须自动完成修复与修复后复核。 +- 首阶段输出闭环聚焦当前产品范围中的 `PNG 输出`、`OFD 输出`、`PDF 高保真路径` 与 `PPTX 高保真路径`,新增格式另行准入。 + +## 已确认现状与缺口 + +### 已有基础 + +| 能力 | 当前事实 | +| --- | --- | +| 桌面承载 | Tauri v2 壳和 Windows MSI/NSIS 构建基线已存在。 | +| 路由模型 | `SemanticDoc`、`WorkbookModel`、`SlideModel`、`FixedLayoutModel` 的路线已经定义。 | +| 可执行 mapper | `SemanticDoc <-> WorkbookModel` 已有执行证据,并写入 `executedMappers`。 | +| 质量载体 | `QualityReport`、block-level `warnings`、`sourceSpan`、diff/checkpoint 界面基础已存在。 | +| 安全边界 | 产品已明确 processing no-network、no-upload 和最小桌面文件权限。 | + +### 当前缺口 + +| 缺口 | 当前表现 | 本设计要求 | +| --- | --- | --- | +| 路径文档漂移 | 代码已开放多条 `-> XML` 路径,`docs/CONVERSION_PATHS.md` 未完全同步 | 实施第一批先建立单一真值矩阵并由测试约束 | +| PNG 输出 | 仅有输入能力,无真实页面图像 writer | 建立 FixedLayout 渲染写出和视觉质量证据后开放 | +| OFD 输出 | 仅有 L0 输入占位,无 writer | 建立 OFD 固定版式 writer 与样例复核 | +| PDF 恢复 | 有读写基线,但扫描件、复杂版面和表格恢复不足 | 接入 OCR/layout/table 与自动修复闭环 | +| PPTX 保真 | 当前写出偏生成型,非 SlideModel 原稿级写回 | 建立 SlideModel 写出和页面级复核 | +| 自动修复 | 现有质量报告主要用于提示 | 增加结构化 repair action、执行器与二次复核 | +| 模型发布 | 文档仍包含“手动安装模型”与“不依赖 OCR/AI”等旧表述 | 实施时更新为安装包内置、本地按需加载的专用模型能力 | + +## 设计边界 + +### “支持所有格式”的含义 + +本设计中的全格式模型支持,是指**所有已进入产品矩阵的格式路径都能够接入统一的本地质量审核与自动修复机制**,而不是由同一个模型直接解析并生成所有文件容器。 + +- reader/writer 继续负责 Markdown、OOXML、PDF、OFD、EPUB、图像等真实文件结构的读取与写出。 +- 规范模型继续承载语义文档、工作簿、幻灯片和固定版面数据。 +- 本地专用模型负责 OCR、版面分析、表格识别、视觉/文字质量审核和结构化修复建议生成。 +- Repair Engine 负责将经过许可的修复动作确定性地应用到规范模型或目标渲染步骤。 +- 修复后必须重新写出并复核,未达门槛时不得将输出报告为高保真成功。 + +### 首阶段格式范围 + +本设计首阶段只补当前产品范围的关键闭环: + +| 能力 | 首阶段范围 | +| --- | --- | +| PNG 输出 | 文档或固定版面到真实 PNG 页面渲染输出,支持质量比对 | +| OFD 输出 | 基础固定版式写出、文字/图像对象写入与本地验收样例 | +| PDF 高保真 | 文本型与扫描型 PDF 的 OCR/layout/table 增强及复核 | +| PPTX 高保真 | SlideModel 驱动的多页写出、基础位置保持及渲染复核 | +| 现有语义/表格格式 | 接入统一审核与自动修复接口,不扩大不合理跨类型推荐路径 | + +以下格式不属于首阶段交付范围:`RTF`、`ODT`、`JPEG`、`WebP`、`SVG` 以及其他尚未进入现有产品矩阵的新增格式。它们后续必须独立完成 reader/writer、质量样例、安全预算与路径准入评审。 + +## 方案选择 + +### 采用方案:格式内核 + 专用模型 + 受控自动修复闭环 + +```text +Input File + -> Format Reader + -> Canonical Model / Renderable Page Representation + -> Format Writer Candidate Output + -> Deterministic Validators + Visual Comparison + -> Local Document Model Reviewer + -> Structured Repair Actions + -> Repair Engine + -> Regenerated Output + -> Post-Repair Validation + -> Final Output Or Controlled Failure +``` + +该方案保留现有可验证转换内核,同时把模型能力放在传统解析难以覆盖的 OCR、布局和异常识别位置。自动修复由软件承担,但只执行可追踪、可复核的结构化动作。 + +### 未采用方案 + +| 方案 | 不采用原因 | +| --- | --- | +| 仅给用户提示,由用户修复 | 不满足产品责任边界,也不适合面向普通用户的安装软件体验。 | +| 通用语言模型直接读写全部格式 | 对容器正确性、版面保真和资源体积不利,且结果难以复核。 | +| 一次性开放所有新增格式与所有跨格式输出 | 会把当前高保真缺口、模型闭环和格式扩张混在同一阶段,无法建立可信验收证据。 | + +## 架构 + +### 1. 模型职责分层 + +模型运行时按任务加载,不进入应用启动路径。允许一个共享运行时承载多个专用子模型,但能力接口必须分离。 + +| 模块 | 职责 | 典型输入 | 典型输出 | +| --- | --- | --- | --- | +| `recognizer` | 图像文字识别 | 扫描 PDF、PNG、OFD 页面图像 | text runs、置信度、bbox | +| `layoutAnalyzer` | 版面区域与阅读顺序识别 | 页面渲染图、固定版面对象 | region tree、reading order | +| `tableRecoverer` | 表格网格和单元格恢复 | 页面区域、文字框 | table model、merge 信息 | +| `qualityReviewer` | 源与候选输出质量审核 | 规范模型、渲染页面、确定性检查结果 | issue list、repair actions、confidence | + +模型必须使用适合文档视觉/文字任务的专用推理资源。模型训练、标注、校准与高精度实验资源由开发侧维护;用户安装包只包含发布所需的推理资源。 + +### 2. Repair Engine + +`Repair Engine` 是模型与 writer 之间的确定性执行边界。模型不得返回任意脚本或直接替换文件字节,只能提出已注册的修复动作。 + +首批动作类别: + +| 动作 | 适用问题 | +| --- | --- | +| `replaceTextRun` | OCR 高置信度错字、乱码与缺字 | +| `insertTextRun` | 已识别但输出遗漏的文字块 | +| `reorderBlocks` | 阅读顺序错误 | +| `restoreTableGrid` | 表格行列、合并区域或内容错位 | +| `adjustBoundingBox` | 页面对象明显偏移、裁切或重叠 | +| `regeneratePageLayout` | 固定版面页面局部重新布局 | +| `selectFallbackRoute` | 当前输出无法达到门槛时切换更保守输出策略 | + +每个动作必须包含: + +```js +{ + actionType, + targetId, + before, + after, + confidence, + evidence, + modelVersion, + sourcePage, + sourceSpan +} +``` + +### 3. 自动修复策略 + +用户不负责决定修复是否应用,但软件仍需控制误修风险: + +1. 确定性检查发现问题,或模型识别出问题。 +2. 模型生成结构化 repair actions。 +3. Repair Engine 仅执行已注册动作,并保留执行前后快照。 +4. writer 重新生成目标输出。 +5. 复核层重新执行结构检查、渲染差异、模型审核和质量评分。 +6. 满足目标格式门槛后输出最终文件。 +7. 不满足门槛时自动选择保守 fallback,或给出“未达到高保真输出门槛”的失败结果,不要求用户手工排错。 + +修复历史对用户可以只表现为“已自动优化”及质量报告摘要;内部仍必须保留审计数据,便于回归、论文实验和问题定位。 + +### 4. 全格式接入方式 + +| 格式域 | 规范模型主线 | 自动修复重点 | +| --- | --- | --- | +| `md/html/txt/json/xml/docx/epub` | `SemanticDoc` | 文本缺失、层级、行内结构、资产引用 | +| `csv/xlsx` | `WorkbookModel` | 网格、合并单元格、表头、公式值/显示值损失 | +| `pptx` | `SlideModel` | 多页、文本框、图像位置、版式保持 | +| `pdf/ofd/png` | `FixedLayoutModel` | OCR、阅读顺序、坐标、表格、页面视觉一致性 | + +对于跨模型输出,RoutePlanner 仍决定路径和损耗等级。模型审核与 Repair Engine 不得把高损失路径静默升级为推荐路径;只有相应 fixture 与质量门禁通过后,产品矩阵才能提升等级或新增可选输出。 + +## 质量报告与审计数据 + +现有 `QualityReport / Warnings / Diff` 继续作为用户与测试的共同接口。实施时扩展质量数据,区分确定性证据、模型证据与修复结果: + +```js +metadata: { + modelReview: { + engine, + modelVersion, + checksum, + quantization, + tasks, + runtimeMs, + device, + inferenceMode: "local" + }, + autoRepair: { + attempted, + appliedActions, + rejectedActions, + fallbackUsed, + postRepairVerified + }, + qualityReport: { + structureFidelity, + textFidelity, + tableFidelity, + assetFidelity, + layoutFidelity, + visualFidelity, + deterministicChecks, + modelChecks, + repairStatus, + finalDecision + } +} +``` + +最终决策至少区分: + +- `verified`: 已达到当前格式门槛并完成复核。 +- `degraded`: 可输出但存在明确不可修复损耗。 +- `failed-quality-gate`: 软件无法生成达到门槛的结果,禁止伪装为高质量成功。 + +## 本地发布与资源治理 + +### 内置交付 + +模型不再采用用户手动安装策略。正式桌面安装包包含: + +```text +Trans2Former installer + app shell and web assets + deterministic conversion core + local inference runtime + document recognition and review model assets + model manifest and checksums +``` + +运行时规则: + +- 启动工作台不得加载大模型。 +- 仅当任务需要 OCR、layout、table 或自动质量修复时加载模型。 +- 基础轻格式转换在模型异常或关闭时仍可执行,但必须披露没有增强修复能力。 +- 模型资源可随应用升级替换,并应允许用户在设置中禁用增强能力以控制资源消耗。 + +### 体积与性能原则 + +用户已确认不需要以非常严格的小安装包为目标,但总包不得无约束膨胀。实施阶段应先测量再设正式门槛: + +- 交付模型优先使用推理导出版本,排除训练检查点、优化器和无关资源。 +- 比较 `FP16` 与 `INT8` 版本在 OCR、布局、表格和审核准确率上的差异,优先采用达到质量门槛的最小版本。 +- OCR/layout/table/reviewer 共享的 tokenizer、字典、字体、运行库和视觉 backbone 应去重。 +- manifest 记录每个模型资产的体积、checksum、量化方式、任务、最低内存和 fallback。 +- Windows 安装包构建后必须报告应用本体、运行时、模型资产与压缩后总包的分项体积。 + +当前不将 `500 MB` 作为硬上限;首个可运行模型构建完成后,以实际质量与硬件测试结果确定正式发布预算。 + +## 输出矩阵闭环路线 + +### 1. 先建立真实矩阵门禁 + +当前代码产品矩阵、文档矩阵和能力声明已有漂移。实施前先统一: + +- `public/core/format-registry.js` 中的产品矩阵。 +- `docs/CONVERSION_PATHS.md` 和 `docs/FORMAT_ROADMAP.md`。 +- UI 输出选项和 capability 展示。 +- `scripts/conversion-capability-audit-test.js` 等矩阵测试。 + +矩阵条目应标记为 `recommended`、`generated`、`degraded` 或 `restricted`,而非仅给出“支持/不支持”。 + +### 2. 首批补齐能力 + +| 输出方向 | 需要建设的能力 | 开放条件 | +| --- | --- | --- | +| `* -> PNG` | FixedLayout 页面渲染器、分页图像导出、视觉回归 | 真实图像 writer 和稳定基线通过 | +| `* -> OFD` | OFD writer、页面对象、字体/图像引用、容器校验 | 固定版式样例和本地校验通过 | +| `PDF -> PDF/DOCX/PNG` | OCR/layout/table、FixedLayout 修复、渲染比对 | 扫描与复杂版面样例达到门槛 | +| `PPTX -> PPTX/PDF/PNG` | SlideModel 多页 writer、元素位置保留、页面比对 | 不再依赖基础重生成路径 | + +`*` 只表示产品矩阵中语义上合理且被明确开放的来源,不意味着无条件开放所有 N x N 组合。 + +### 3. 新增格式后置 + +只有上述闭环稳定后,才评估 `JPEG/WebP/SVG` 输出和 `RTF/ODT` 等新办公格式,避免新增格式掩盖现有路径质量不足。 + +## 错误处理与降级 + +| 情况 | 产品处理 | +| --- | --- | +| 模型资源缺失或校验失败 | 增强路径不可用;基础可执行路径继续工作并记录明确 warning | +| 推理运行时加载失败 | 不读取或上传用户内容;返回本地能力失败状态和 fallback 结果 | +| 自动修复低置信度 | 不执行该动作,改用保守输出或返回质量门禁失败 | +| 修复后二次复核失败 | 丢弃修复结果,选择 fallback 或失败,不返回“高保真成功” | +| 当前格式没有高保真 writer | 矩阵继续维持 generated/restricted,不以模型审核替代 writer 实现 | + +## 测试与验收 + +### 模型和修复门禁 + +- 每种模型任务具有固定公开样例集和可重复的离线推理记录。 +- OCR 记录字符准确率与关键字段错误;layout 记录区域/阅读顺序质量;table 记录结构恢复准确率。 +- 自动修复记录动作正确率、误修率、修复后质量提升和失败 fallback 成功率。 +- 所有修复必须证明修复后二次复核真实执行,而不是仅生成建议。 + +### 格式和视觉门禁 + +- 当前 P9-A PDF 视觉门禁作为视觉证据起点,后续扩展至扫描 PDF、PNG、OFD 和 PPTX 页面。 +- 每条提升为 recommended 的高风险路径必须有格式 fixture、渲染对比、质量报告和路由分类断言。 +- 产品矩阵、UI 选项、writer 注册和文档必须通过自动测试保持同步。 + +### 桌面和发布门禁 + +- 桌面安装包实际包含模型 manifest、资源文件与校验值。 +- Windows 安装后在断网状态完成至少一个基础转换、一个 OCR/layout 修复场景和一个质量复核输出。 +- 测试处理阶段不发生网络访问,不保存用户文档内容到模型资产或诊断资源。 +- 构建报告显示安装包总体积与模型/运行时的分项体积,并在模型更换时可比较。 + +## 与现有文档的实施同步点 + +本规格批准后,实施计划必须安排以下现有文档更新;在能力真正实现前,不提前把未完成路径描述为可用: + +| 文档 | 需要调整的内容 | +| --- | --- | +| `DEVELOPMENT_TASKS.md` | 将本地模型、自动修复和输出闭环登记为后续阶段;修正“不依赖 OCR/AI”的表述 | +| `docs/DESKTOP_APP_ARCHITECTURE.md` | 将模型交付由手动安装改为正式包内置、运行按需加载 | +| `docs/DESKTOP_RELEASE_PLAN.md` | 增加内置模型 manifest、安装包体积报告与离线修复 smoke | +| `docs/RESOURCE_BUDGET.md` | 区分轻量核心预算与模型增强桌面包预算 | +| `docs/PRODUCT_STRATEGY.md` | 明确专用本地模型和软件自动修复职责 | +| `docs/CONVERSION_PATHS.md` / `docs/FORMAT_ROADMAP.md` | 与真实代码矩阵、PNG/OFD/PDF/PPTX 闭环阶段同步 | +| `docs/MULTI_MODEL_ARCHITECTURE.md` | 将旧的插件/external engine 表述替换为核心本地内置模型与 Repair Engine | + +## 分阶段实施顺序 + +1. **S1 设计同步与矩阵真值校准**:修正文档冲突,统一产品矩阵、路由分类、UI 与测试真值。 +2. **S2 Repair Engine 与审核数据契约**:实现结构化修复动作、质量报告扩展、修复后二次复核和 fallback 机制。 +3. **S3 本地模型运行容器与交付治理**:接入内置模型目录、manifest、checksum、按需加载、资源报告和离线安全门禁。 +4. **S4 FixedLayout 闭环**:优先实现 PDF/PNG/OFD 的 OCR、layout、table、渲染输出与自动修复质量证据。 +5. **S5 SlideModel 闭环**:实现 PPTX 多页保真写出、PDF/PNG 渲染出口与视觉复核。 +6. **S6 发布验收与后续格式准入**:验证断网安装包和模型体积/性能,再决定新增图片与办公格式范围。 + +## 论文与专利可表达的技术主线 + +本设计可以形成稳定技术命题: + +> 一种面向多格式文档转换的本地高保真自动修复方法及系统,通过格式解析器构建规范中间模型,利用专用文档视觉文字模型生成结构化质量缺陷与修复动作,由受控修复引擎自动执行修复,并通过结构检查和渲染质量复核决定最终输出或降级策略,从而在无云端处理和无用户修复介入条件下提高跨格式转换保真度。 + +可量化实验维度包括: + +- 不同格式路径的转换成功率与质量分级准确性。 +- 自动修复前后的 OCR 字符准确率、表格结构准确率、页面视觉相似度和误修率。 +- 模型量化前后的质量、速度、内存与安装包体积变化。 +- 无模型基础路径、模型审核路径与模型自动修复路径的对照结果。 + +## 非目标 + +- 不让用户承担修复判断或手工修正流程。 +- 不使用云端 OCR、云端 AI、云端转换或后台遥测。 +- 不以内置通用对话模型替代格式解析器、writer 或 Repair Engine。 +- 不在本阶段一次性增加全部候选格式或开放无质量证据的 N x N 转换矩阵。 +- 不因为模型能够发现缺陷,就将尚无真实 writer 的输出路径描述为已完成。 diff --git a/docs/superpowers/specs/2026-05-28-lightweight-default-bundle-direction.md b/docs/superpowers/specs/2026-05-28-lightweight-default-bundle-direction.md new file mode 100644 index 0000000..4f52c91 --- /dev/null +++ b/docs/superpowers/specs/2026-05-28-lightweight-default-bundle-direction.md @@ -0,0 +1,328 @@ +# Lightweight Default Bundle + On-Demand Model Direction + +状态:生效(覆盖 2026-05-27 spec 中冲突部分) +日期:2026-05-28 +适用范围:默认安装包形态、OCR / 版面 / 表格能力交付方式、转换后检验路线、P9 阶段分解 +前置基础:S1 矩阵真值与方向同步门禁、S2 Repair Engine 与审核数据契约 +被替换文档:[../specs/2026-05-27-local-document-model-auto-repair-output-closure-design.md](2026-05-27-local-document-model-auto-repair-output-closure-design.md) 中「模型随桌面安装包内置」「正式安装包包含 model assets」相关段落 + +## 一、项目定位确认 + +Trans2Former 的核心定位保持为: + +> 本地优先的多格式文档转换工具。核心转换依靠可测试的软件算法完成;OCR、版面分析和转换后校验作为核心内置增强能力按需加载,并始终在本机执行。 + +项目不走云端 OCR / 云端 AI / 远程转换路线,也不恢复插件路线。 + +## 二、核心原则 + +1. **主转换仍然依靠软件算法** + + - Markdown、HTML、TXT、JSON、CSV、XML、DOCX、XLSX、PPTX、PDF 等格式转换,仍然主要依靠自研 reader / mapper / writer。 + - 不允许把 LLM / VLM 作为核心转换器。 + - AI / OCR 只负责识别、辅助结构恢复、质量检验,不直接负责生成最终文件。 + +2. **坚持本地优先** + + - 用户文档、图片、文件名、转换结果、错误日志不得上传。 + - 不接入远程 OCR API、云端 AI API、第三方转换 API。 + - 所有 OCR、版面分析、质量检验必须在用户本机执行。 + +3. **放弃插件路线** + + - 不再设计 plugin install / external plugin marketplace。 + - OCR、OFD、版面分析、表格恢复、高保真渲染等能力作为核心内置模块演进。 + - 但「核心内置」不等于「默认全部打包」,重能力仍然按需启用、按需加载。 + +4. **默认安装包必须保持轻量** + + - 默认包不内置 PaddleOCR-VL、Qwen-VL、MinerU 等 GB 级模型。 + - 默认包只包含主程序、基础转换能力、必要的轻量依赖。 + - 目标安装包体积控制在 30–80 MB 左右。 + - OCR 模型资源应在用户首次启用 OCR 时单独下载或本地配置。 + +## 三、OCR 路线 + +OCR 分成两层: + +### 1. 轻量 OCR 模式 + +目标: + +- 图片文字识别 +- 扫描 PDF 基础识别 +- 中英文基础 OCR +- OCR 结果转 TXT / Markdown / PDF +- 普通电脑 CPU 可运行 + +实现要点: + +- 第一阶段可使用 Tesseract.js 或轻量 PaddleOCR / PP-OCR。 +- OCR 资源不进入默认主包,首次启用时提示用户下载本地 OCR 资源。 +- 下载完成后模型保存在本地 model-cache。 +- 整个识别过程不得联网上传文档。 + +### 2. 高级 OCR / 文档解析模式 + +目标: + +- 复杂扫描 PDF +- 表格恢复 +- 公式识别 +- 多栏版面 +- 图表识别 +- 阅读顺序恢复 +- 高质量版面分析 + +实现要点: + +- 后续接入 PaddleOCR-VL / MinerU 等本地文档解析模型。 +- 这类模型体积可能达到 1 GB 以上,不得默认内置。 +- 用户主动启用高级模式时,明确提示模型体积、硬件要求、运行成本。 +- 仍然必须本地执行,不调用远程 API。 + +## 四、OCR 接入架构 + +OCR 不作为独立产品功能孤立存在,而应该进入现有转换链路。推荐流程: + +```text +PNG / 扫描 PDF + ↓ +Page Rasterizer + ↓ +OCR Engine + ↓ +Layout Recovery + ↓ +FixedLayoutModel + ↓ +fixedLayoutToSemantic + ↓ +目标 Writer +``` + +OCR 输出不应该直接变成 Markdown,而应该先进入固定版式模型。 + +建议新增核心目录: + +```text +public/core/ocr/ +├── ocr-engine.js +├── ocr-types.js +├── image-preprocess.js +├── light-ocr-engine.js +├── paddleocr-bridge.js +└── ocr-quality.js + +public/core/layout/ +├── layout-recovery.js +├── reading-order.js +├── table-recovery.js +└── fixed-layout-builder.js + +public/core/verification/ +├── rule-verifier.js +├── visual-diff.js +├── ocr-roundtrip.js +└── quality-score.js +``` + +这些不是插件目录,而是核心内置模块目录。 + +## 五、模型与工具选择 + +```text +默认转换: + 自研 reader / mapper / writer + +轻量 OCR: + Tesseract.js 或轻量 PaddleOCR / PP-OCR + +高级 OCR: + PaddleOCR-VL / MinerU + +转换后检验: + 规则 diff + SSIM + OCR 回读 + +不推荐: + 云端 OCR API + 通用 VLM 直接生成目标文件 + 默认内置 GB 级模型 +``` + +PaddleOCR 不需要调用云 API,可以本地运行。但 PaddleOCR-VL 体积较大,应作为高级本地 OCR 资源按需下载,而不是默认打包。 + +## 六、转换后文件检验 + +转换后检验是 Trans2Former 的核心增强点,分三层: + +### 1. 规则级检验 + +不用 AI,直接检查: + +- 页数是否异常变化 +- 文本长度是否明显减少 +- 图片数量是否丢失 +- 表格数量是否丢失 +- 标题数量是否异常 +- 是否出现空白页 +- 是否出现乱码 +- 是否出现异常超长行 + +### 2. 视觉级检验 + +用于高保真转换路径: + +```text +源文件渲染为页面图片 +目标文件渲染为页面图片 +逐页 SSIM / pixel diff +异常页写入 QualityReport +``` + +先做非 AI 版 SSIM,不一开始引入大模型。 + +### 3. OCR 回读检验 + +本项目的重要差异化能力: + +```text +源文档文本 / OCR 文本 + ↓ +目标文档渲染截图 + ↓ +目标截图 OCR + ↓ +源文本与目标 OCR 文本 diff + ↓ +生成质量报告 +``` + +可检测: + +- 漏字 +- 乱码 +- 内容丢失 +- 表格文字丢失 +- 图片中文字丢失 +- 页眉页脚丢失 +- 排版导致的文本不可读 + +## 七、质量报告接入 + +所有 OCR、版面恢复、转换后检验结果都应该写入现有 QualityReport / warnings 机制,不另起一套报告系统。示例形态: + +```json +{ + "score": 0.91, + "textConsistency": 0.96, + "layoutConsistency": 0.88, + "assetConsistency": 0.93, + "ocrConfidence": 0.89, + "issues": [ + { + "severity": "warning", + "page": 3, + "type": "OCR_LOW_CONFIDENCE", + "message": "第 3 页部分文字 OCR 置信度较低,建议人工复核。" + }, + { + "severity": "error", + "page": 5, + "type": "TEXT_CONTENT_LOST", + "message": "第 5 页转换后疑似存在文本丢失。" + } + ] +} +``` + +## 八、开发阶段建议(P9-A/B/C/D) + +### P9-A:OCR 基线 + +目标: + +- PNG 能本地 OCR +- 扫描 PDF 能先渲染成页面图片,再 OCR +- 输出 OCRResult +- 接入 warnings / QualityReport + +不要求: + +- 完美表格恢复 +- 公式识别 +- 复杂版面恢复 + +### P9-B:OCR 到模型 + +目标: + +- OCRResult → FixedLayoutModel +- FixedLayoutModel → SemanticDoc +- 保留 bbox、confidence、page index、reading order +- 对视觉损耗和阅读顺序推断发 warning + +### P9-C:转换后检验 + +目标: + +- 规则 diff +- 文本 diff +- SSIM 视觉对比 +- OCR roundtrip 检验 +- 统一写入 QualityReport + +### P9-D:高级 OCR + +目标: + +- 接入 PaddleOCR-VL / MinerU +- 支持表格、公式、复杂版面、多栏文档 +- 模型资源按需下载 +- 明确体积、运行内存、降级路径和失败提示 + +P9-A 之前还需要先完成 **S3:按需下载与本地缓存治理**,定义 model-cache 目录结构、manifest 字段、checksum 校验、可清理入口、断网降级提示和首次启用的下载提示流程。 + +## 九、最终决策 + +Trans2Former 后续不应该变成「AI 文档转换器」,而应该变成: + +> 确定性格式转换 + 本地 OCR + 本地版面恢复 + 转换质量检验的文档转换系统。 + +最终策略: + +```text +默认主程序: + 轻量、本地、可测试、无云端依赖 + +OCR: + 核心内置能力,但按需启用、按需下载模型 + +高级 OCR: + 独立本地模型资源,不进入默认安装包 + +转换检验: + 规则 diff + SSIM + OCR 回读,作为项目核心差异化能力 +``` + +一句话总结: + +> 主转换靠软件算法,OCR 和质量检验作为本地核心增强;坚持不上传、不调用云 API、不默认内置大模型,以「轻量默认包 + 按需本地 OCR + 高级模型独立资源」的方式落地。 + +## 与 2026-05-27 spec 的差异点 + +| 维度 | 2026-05-27 spec | 本 spec(2026-05-28) | +| --- | --- | --- | +| 模型交付方式 | 「正式桌面安装包包含 document recognition and review model assets」 | OCR 模型资源不进入默认安装包;首次启用时本地下载到 model-cache | +| 默认安装包形态 | 未给具体体积上限,仅说「分项体积报告」 | 目标 30–80 MB;构建后必须报告主程序、轻量依赖、空 model-cache 总和 | +| OCR / layout / table 资源 | 「随应用升级替换」「manifest 记录模型资产」(暗示打包) | 按 OCR 启用动作触发下载;manifest、checksum、缓存路径、可清理入口、断网降级提示作用于下载缓存目录而非安装包 | +| 高级 OCR(PaddleOCR-VL / MinerU) | 与轻量 OCR 同列在「内置模型 manifest」下 | 明确为「独立本地模型资源,不进入默认安装包」,启用时明确提示体积与硬件要求 | +| 阶段编排 | S3「本地模型运行容器与交付治理:接入内置模型目录」 | S3 重新定义为「按需下载与缓存治理」;其后依次推进 P9-A/B/C/D | +| 转换后检验 | 未单独成阶段 | 提升为 P9-C 核心差异化能力,包含规则 diff、SSIM、OCR 回读三层 | + +不受影响(继续生效): + +- 五个并列规范模型(SemanticDoc / WorkbookModel / SlideModel / FixedLayoutModel / AssetGraph)。 +- S2 Repair Engine 设计与实现([../../core/repair-engine.js](../../../public/core/repair-engine.js) 等)。 +- 「不依赖云端 OCR/AI、processing 阶段禁联网、不上传任何文档内容」边界。 +- 产品矩阵真值与文档守门(S1 矩阵 / 方向同步测试)。 diff --git a/docs/superpowers/specs/2026-05-28-on-demand-model-cache-design.md b/docs/superpowers/specs/2026-05-28-on-demand-model-cache-design.md new file mode 100644 index 0000000..df35022 --- /dev/null +++ b/docs/superpowers/specs/2026-05-28-on-demand-model-cache-design.md @@ -0,0 +1,119 @@ +# On-Demand Model Cache Design (S3) + +状态:生效 +日期:2026-05-28 +前置基础:S1 矩阵真值、S2 Repair Engine、UI-A 三视图重构、[2026-05-28-lightweight-default-bundle-direction.md](2026-05-28-lightweight-default-bundle-direction.md) +后续阶段:P9-A OCR 基线、P9-B OCR→FixedLayoutModel、P9-C 转换后检验三层、P9-D 高级 OCR + +## 目标 + +为「OCR / 版面 / 表格 / 质量审核模型资源不进入默认安装包,首次启用时本地下载到 model-cache」的方向决策,提供**可被 P9-A 直接接入的代码契约与基础设施**,但不实际下载、不接入第三方 OCR 引擎、不修改 Tauri CSP。 + +S3 落地后: +- 任何未来的 OCR / layout / table / quality-reviewer 实现,只要构造合法的 `ModelManifest` 并 `defaultModelCache.register(manifest)`,立刻获得: + - SHA-256 校验 + - 状态机(not-downloaded / importing / verifying / available / degraded / disabled) + - 安全中心 UI 卡片自动渲染 + - 中文文案(首次启用 / 断网降级 / 清理) + - 稳定的缓存路径:`model-cache////...` +- Repair Engine 的 `modelReview` 元数据将能引用 `ModelCacheRegistry.getStatus(manifestId).summary` 把模型 manifest 信息回填到 QualityReport。 + +## 数据契约 + +### `ModelManifest` + +```js +{ + schemaVersion: "trans2former.model-manifest.v1", + manifestId: string, // 全局唯一,如 "ocr-text.tesseract.1.0.0" + task: "ocr-text" | "ocr-layout" | "ocr-table" | "quality-reviewer", + engine: "tesseract" | "paddleocr" | "paddleocr-vl" | "mineru" | "custom", + modelVersion: string, // SemVer 或自定义版本串 + bundleSize: number, // bytes,> 0 + quantization: "fp32" | "fp16" | "int8" | "none", + minMemoryMB: number, // >= 0 + sources: [{ kind, path, ... }], // 用户导入文件路径或内置 vendor 路径,禁止远程 URL(CSP 也禁) + checksums: { + algorithm: "SHA-256", // 锁死,禁用 MD5/SHA-1 + digest: string, // hex + perFile: { [relPath]: hex } + }, + fallback: { + onFailure: "skip-task" | "use-degraded-route" | "fail-quality-gate", + message: string, + }, + ui: { + label: string, // 用户可见任务名 + description: string, + enableHint: string, + } +} +``` + +`createModelManifest` 自动填默认字段并冻结深层对象;`validateModelManifest` 在缺字段、未知 task/engine、非 SHA-256、不存在的 fallback 策略等情况下抛 `ConversionError({ code: "MODEL_MANIFEST_INVALID" })`。 + +### 状态机 + +``` +not-downloaded → importing → verifying → available + ↓ ↓ ↓ + degraded ←────────┘ disabled (用户主动) +``` + +允许的状态常量:`STATUS_NOT_DOWNLOADED`、`STATUS_IMPORTING`、`STATUS_VERIFYING`、`STATUS_AVAILABLE`、`STATUS_DEGRADED`、`STATUS_DISABLED`。`setStatus` 接收未在列表中的状态会抛 `MODEL_CACHE_STATUS_INVALID`。 + +### 缓存路径 + +唯一格式 `model-cache////`,由 `getCacheKey` / `getCacheDirectory` / `getCacheFilePath` 计算。任何函数都不能写文件系统(S3 阶段保持内存状态);P9-A 接入时再桥接 IndexedDB / Tauri fs。 + +`fileName` 必须是相对路径,禁止 `..` / 绝对路径 / `\`,否则抛 `MODEL_CACHE_PATH_INVALID`。 + +## 运行时模块 + +| 文件 | 职责 | +| --- | --- | +| [`public/core/model-cache/manifest.js`](../../../public/core/model-cache/manifest.js) | `createModelManifest` / `validateModelManifest` / `summarizeManifest` / 常量 | +| [`public/core/model-cache/checksum.js`](../../../public/core/model-cache/checksum.js) | `sha256Hex` / `verifyChecksum`(crypto.subtle.digest) | +| [`public/core/model-cache/cache-paths.js`](../../../public/core/model-cache/cache-paths.js) | 缓存路径推导 | +| [`public/core/model-cache/availability.js`](../../../public/core/model-cache/availability.js) | `ModelCacheRegistry` + `defaultModelCache` + 状态常量 | +| [`public/core/model-cache/ui-text.js`](../../../public/core/model-cache/ui-text.js) | 4 类任务的中文 UI 文案常量 | + +全部模块对外通过 [`public/browser-transformer.js`](../../../public/browser-transformer.js) 顶层导出,P9-A 实施时 `import { defaultModelCache, createModelManifest, sha256Hex, ... } from "./browser-transformer.js"` 即可。 + +## UI 接入 + +安全中心 dialog (`public/index.html`) 新增「模型缓存」card: + +``` +┌── 模型缓存 · model-cache ───────────────────┐ +│ 规划任务:文字 OCR / 版面分析 / 表格恢复 / 质量审核 │ +│ │ +│ ┌── 文字 OCR · tesseract · 1.0.0 ─ 未启用 ─┐ │ +│ │ 12 MB · 首次启用本地导入 │ │ +│ └─────────────────────────────────────────┘ │ +│ (S3 阶段:尚未注册模型清单) │ +└──────────────────────────────────────────────┘ +``` + +由 [`public/security-center.js`](../../../public/security-center.js) 的 `renderModelCache` 渲染,监听 `defaultModelCache.onChange` 自动刷新。S3 阶段无任何 register 调用,因此卡片显示空状态文案;P9-A 接入第一条 manifest 后自动出现条目。 + +## 不引入 + +- **不引入 npm 依赖**:SHA-256 用浏览器/Node 内置 `crypto.subtle.digest`。 +- **不实际下载模型**:sources 数组现阶段只接受 user-provided / vendor-bundle 两种 kind;不出现远程 URL;不调用 fetch。 +- **不修改 Tauri CSP**:`connect-src 'self'` 保留。 +- **不修改 Repair Engine**:S2 接口与 P9-A 接入点保持稳定。 +- **不持久化状态**:S3 阶段所有状态在内存 Map 中;P9-A 接入实际下载机制时再决定 IndexedDB / Tauri fs。 + +## 守门 + +- 新增脚本 [`scripts/model-cache-test.js`](../../../scripts/model-cache-test.js) 覆盖 manifest 校验、checksum、缓存路径、状态机、UI 文案 9 组断言。 +- [`scripts/local-model-direction-test.js`](../../../scripts/local-model-direction-test.js) 多模型架构守门增加 `defaultModelCache` / `MODEL_TASKS` / `MODEL_ENGINES` / `createModelManifest` 关键词锁定。 +- [`scripts/local-security-test.js`](../../../scripts/local-security-test.js) 自动覆盖 `public/core/model-cache/**` —— 这 5 个文件不含 `fetch` / `localStorage` / `XHR` / `WebSocket`,符合 local-only 默认策略。 + +## 未来扩展(P9 之后实施,不在本 spec 范围) + +- IndexedDB / Tauri fs 持久化:把已校验通过的模型字节落地,下次启动免重新导入。 +- 模型来源治理:vendor-bundle 走仓库内 sample manifest;user-provided 走"选择文件"对话框。仍然不引入远程 URL。 +- Repair Engine 集成:把 `modelReview` 中 engine / modelVersion 替换为实际 manifest 数据。 +- 安全中心管理操作:手动清理、禁用、强制重新校验入口。 diff --git a/docs/superpowers/specs/2026-05-28-p9a-ocr-baseline-design.md b/docs/superpowers/specs/2026-05-28-p9a-ocr-baseline-design.md new file mode 100644 index 0000000..cc0118b --- /dev/null +++ b/docs/superpowers/specs/2026-05-28-p9a-ocr-baseline-design.md @@ -0,0 +1,158 @@ +# P9-A.1 OCR Baseline Design (契约 + 占位 + 接入点) + +状态:生效 +日期:2026-05-28 +前置基础:S1 矩阵真值 / S2 Repair Engine / S3 Model Cache / UI-A 三视图重构 / [2026-05-28-lightweight-default-bundle-direction.md](2026-05-28-lightweight-default-bundle-direction.md) +后续阶段:P9-A.2 接入轻量 OCR runtime / P9-A.3 端到端 PNG + 扫描 PDF / P9-B OCR→FixedLayoutModel / P9-C 转换后检验三层 / P9-D 高级 OCR + +## 目标 + +为 OCR 转换链路提供**可被未来真实引擎接入的代码契约与占位实现**,但不引入任何 OCR runtime(Tesseract / PaddleOCR / MinerU),不修改 Tauri CSP,不实际跑推理。 + +P9-A.1 落地后: +- 任何未来的 OCR 引擎,只要构造一个符合 `OCREngine` 接口的对象并 `defaultOCRRegistry.register(engine)`,立刻获得:路径调度(`pickForTask("ocr-text")`)、结构化 OCRResult 数据契约(含语言、页面、行、置信度、bbox)、统一 warning 编号、安全中心 UI 显示位、Repair Engine `modelReview` 回填位。 +- PNG reader 现在能感知 OCR engine 状态:未启用时输出 `OCR_UNAVAILABLE` info 级 warning,但仍按"图片资产"路径产生 SemanticDoc,不阻塞 md / txt / html 输出。 + +## 数据契约 + +### `OCRResult` + +```js +{ + schemaVersion: "trans2former.ocr-result.v1", + language: "zh-CN" | "zh-TW" | "en" | "ja" | "ko" | "auto", + pages: [ + { + pageIndex: number, // >= 0 + width: number, // >= 0 + height: number, // >= 0 + lines: [ + { + text: string, + confidence: number, // in [0, 1] + bbox: { x, y, w, h } | null, + } + ] + } + ], + fullText: string, + averageConfidence: number, // in [0, 1] + runtimeMs: number, // >= 0 + engine: string, // engine id + modelVersion: string, + warnings: Warning[], // 可选 OCR_LOW_CONFIDENCE 等 +} +``` + +`createOCRResult` 自动填默认字段并深度冻结;`validateOCRResult` 在缺字段、未知 language、非数组 pages、超出 [0,1] 的置信度等情况下抛 `ConversionError({ code: "OCR_RESULT_INVALID" })`。`summarizeOCRResult` 返回 `{ pageCount, lineCount, averageConfidence, fullTextLength, engine, modelVersion, runtimeMs, language }`,供 modelReview / QualityReport 引用。 + +### `OCREngine` 接口 + +```js +{ + id: string, // 全局唯一,非空,如 "placeholder"、"tesseract-zh" + taskCapabilities: ["ocr-text"], // 必须非空数组 + manifestId?: string, // 对应 defaultModelCache 的 manifestId + isAvailable(): boolean, + recognize({ image, options }): Promise, +} +``` + +- 不强制 `engine` 是 class 或 Object.create — 任何符合接口的 plain object 都能注册。 +- `isAvailable()` 必须是同步函数(PNG reader 同步签名需要);`recognize` 必须是异步函数。 +- `manifestId` 为可选字段;若指定,建议与 ModelCacheRegistry 中已 register 的 manifest 一致。 + +### `OCREngineRegistry` + +``` +register(engine) // 检查接口合规、ID 不重复 +unregister(id) +has(id), list(), pickById(id) +pickForTask(task) // 优先 isAvailable === true 的 engine;都不可用时返回最后一条 +onChange(callback) +reset() +``` + +错误编号:`OCR_ENGINE_INVALID` / `OCR_ENGINE_DUPLICATE` / `OCR_ENGINE_UNKNOWN`。 + +`defaultOCRRegistry` 是模块级单例,由 `ocr-bootstrap.js` 在第一次 import 时注册 `placeholderOCREngine`。 + +### Warning 编号 + +- `OCR_UNAVAILABLE`(info) —— OCR engine 未启用或未注册时由 reader / convert pipeline 注入。 +- `OCR_LOW_CONFIDENCE`(lossy) —— 真实 engine 接入后,averageConfidence 低于阈值时使用。 +- `OCR_ENGINE_FAILED`(lossy) —— recognize 抛错时降级使用。 +- `OCR_DEGRADED_ROUTE`(info) —— 整条路径被强制降级到 fallback 时使用。 + +所有 warning 工厂在 `public/core/ocr/ocr-warnings.js` 中。 + +## 运行时模块 + +| 文件 | 职责 | +| --- | --- | +| [`public/core/ocr/ocr-result.js`](../../../public/core/ocr/ocr-result.js) | OCRResult 契约 + `createOCRResult` / `validateOCRResult` / `summarizeOCRResult` | +| [`public/core/ocr/ocr-warnings.js`](../../../public/core/ocr/ocr-warnings.js) | OCR warning code 常量 + 工厂函数 | +| [`public/core/ocr/ocr-engine.js`](../../../public/core/ocr/ocr-engine.js) | `OCREngine` 接口校验 + `OCREngineRegistry` + `defaultOCRRegistry` 单例 | +| [`public/core/ocr/placeholder-engine.js`](../../../public/core/ocr/placeholder-engine.js) | `placeholderOCREngine` —— 永远 unavailable,调用 recognize 抛 `OCR_UNAVAILABLE` | +| [`public/core/ocr/ocr-bootstrap.js`](../../../public/core/ocr/ocr-bootstrap.js) | 副作用模块:import 时把 placeholder engine 注册到 `defaultOCRRegistry`、对应 manifest 注册到 `defaultModelCache` 并立刻设为 `disabled` | + +[`public/browser-transformer.js`](../../../public/browser-transformer.js) 顶层 `import "./core/ocr/ocr-bootstrap.js"` 触发副作用,并把 OCR 模块的所有 API export 出去,P9-A.2 接入真实 engine 时一处 import 即可。 + +## 接入点 + +### PNG reader + +[`public/formats/png.js`](../../../public/formats/png.js) 在 `readPng` 末尾调用 `defaultOCRRegistry.pickForTask("ocr-text")`: + +- engine 不可用 → `metadata.warnings` 注入 `OCR_UNAVAILABLE`,含 `engineId` / `manifestId` / `reason`。 +- engine 可用 → P9-A.2 在这里调用 `engine.recognize(...)` 把 OCR 文本作为 paragraph blocks 插入 SemanticDoc;recognize 抛错时降级到现有 image asset 路径并发 `OCR_ENGINE_FAILED`。 + +PNG reader 同步签名不变;P9-A.2 接入真实 engine 时可以选择: +1. 把 reader 切成 async(影响 format-registry.js read() 签名)。 +2. 或新增 convert pipeline 的"OCR enhancement stage"在 read 之后异步增强 model。 + +建议走 (2),把 reader 保持 pure 同步、async OCR 作为独立 stage 注入。本 spec 不在 P9-A.1 阶段决定。 + +### Repair Engine `modelReview` + +`defaultRepairEngine.runCycle` 在生成 `modelReview` 占位字段时,P9-A.2 真实 engine 接入后可以把 `summarizeOCRResult(result)` 输出回填进去,使 QualityReport 可观察。`tasks` 数组可以扩展为 `["lossy-warning-scan", "route-class-check", "ocr-text-recognition"]`。 + +### 安全中心 + +S3 已经搭好的「模型缓存」card 在 `defaultModelCache.onChange` 时自动刷新;bootstrap 注册 placeholder manifest 后立即显示「OCR 文字识别 · 占位」条目,状态显示「已禁用」,提示文案说明"等待 P9-A.2 接入真实模型"。无需改 UI 代码。 + +## 守门 + +新增脚本 [`scripts/ocr-baseline-test.js`](../../../scripts/ocr-baseline-test.js) 10 组断言: +- Schema 常量稳定(OCR_RESULT_SCHEMA_VERSION、OCR_LANGUAGES、OCR_WARNING_CODES.length === 4)。 +- `createOCRResult` 冻结结果 + 字段填充 + `summarizeOCRResult` 计算正确。 +- `validateOCRResult` 拒绝错误 language / 非数组 pages / 越界 averageConfidence / 越界 runtimeMs / 错误页几何 / 错误行置信度。 +- 4 个 warning 工厂返回正确 code / severity。 +- `OCREngineRegistry`:register 合规校验、重复 register 抛 `OCR_ENGINE_DUPLICATE`、错误对象抛 `OCR_ENGINE_INVALID`。 +- `pickForTask` 优先选 isAvailable 的 engine;都不可用时 fallback 到最后一条。 +- `placeholderOCREngine.isAvailable() === false`;`recognize` 抛 `OCR_UNAVAILABLE`。 +- `ensureOCRBootstrap` 幂等;调用后 `defaultModelCache.getStatus(PLACEHOLDER_OCR_MANIFEST_ID).status === STATUS_DISABLED`、`defaultOCRRegistry.pickForTask("ocr-text").id === "placeholder"`。 +- PNG reader 在 placeholder 模式下注入 `OCR_UNAVAILABLE` warning(含 engineId/manifestId)。 +- PNG reader 在临时注册可用 stub engine 后不再注入 `OCR_UNAVAILABLE`。 + +`scripts/local-model-direction-test.js` 守门关键词增加 `defaultOCRRegistry` / `createOCRResult` / `OCR_UNAVAILABLE` / `placeholderOCREngine`。 + +`scripts/local-security-test.js` 自动覆盖 `public/core/ocr/**`:5 个文件 0 命中 `fetch` / `localStorage` / `XHR` / `WebSocket`,保持 local-only。 + +## 不引入 + +- 不引入 Tesseract.js、PaddleOCR、PaddleOCR-VL、MinerU 或任何第三方 OCR runtime。 +- 不修改 Tauri CSP / capabilities。 +- 不写实际 OCR 推理代码;不创建真实 OCR 测试 fixture。 +- 不引入新 npm 依赖;`optionalDependencies` 保持仅含 `pdfjs-dist`。 +- 不动 Repair Engine、转换核心、其它 reader / writer、UI 路由与预览。 +- 不动产品矩阵(PNG 输出仍是 `["html","txt","json","pdf"]`)。 + +## 未来扩展(P9-A.2+,不在本 spec 范围) + +- 把 Tesseract.js core 资源 vendor 到 `public/vendor/tesseract/`(类似 `sync-pdfjs-vendor.js`),实现 `TesseractEngine` 子类。 +- 用户启用 OCR 时本地导入 tessdata 到 IndexedDB;SHA-256 校验通过后通过 `defaultModelCache.setStatus(manifestId, STATUS_AVAILABLE)` 激活。 +- 修改 Tauri CSP 让 wasm 与 worker 加载。 +- 引入扫描 PDF 渲染管线(pdfjs canvas → toBlob → recognize)。 +- OCR 文本作为 paragraph blocks 写入 SemanticDoc;附 confidence 信息进入 RepairAction.evidence。 +- Repair Engine 把 `summarizeOCRResult` 回填到 `modelReview`,让 QualityReport 携带 OCR 证据。 diff --git a/docs/superpowers/specs/2026-05-28-p9a2-tesseract-runtime-design.md b/docs/superpowers/specs/2026-05-28-p9a2-tesseract-runtime-design.md new file mode 100644 index 0000000..f2059c5 --- /dev/null +++ b/docs/superpowers/specs/2026-05-28-p9a2-tesseract-runtime-design.md @@ -0,0 +1,123 @@ +# P9-A.2 Tesseract Runtime Design (vendor + 骨架 + CSP) + +状态:生效 +日期:2026-05-28 +前置基础:P9-A.1 OCR 契约与占位 / S3 Model Cache / UI-A 三视图重构 / [2026-05-28-lightweight-default-bundle-direction.md](2026-05-28-lightweight-default-bundle-direction.md) +后续阶段:P9-A.2.b tessdata IDB + UI 启用 + 真实 OCR / P9-A.3 端到端 PNG + 扫描 PDF + +## 目标 + +为 OCR 链路接入第一条真实 runtime 候选——`tesseract.js`,但本轮**只完成 vendor 同步 + Engine 骨架 + CSP 调整**,不实现 tessdata 持久化、UI 启用流程或真实推理。落地后: + +- 在 `public/vendor/tesseract/` 中静态存放 `tesseract.js` core / wasm / worker 文件(约 4–6 MB),与 `pdfjs-dist` 同样的 vendor 模式。 +- `defaultOCRRegistry` 中除了 placeholder,还多了一条 `tesseractOCREngine`;后者 `isAvailable()` 依据 vendor 资源 + storage 中是否存在 tessdata 共同决定,A.2 阶段始终为 false。 +- `defaultModelCache` 多注册一条 `ocr-text.tesseract.5.0.0` manifest,status 为 `not-downloaded`;安全中心立刻显示「Tesseract.js OCR」条目并提示导入 tessdata。 +- Tauri CSP 加 `'wasm-unsafe-eval'` 让 wasm 在 WebView 中可实例化;其他指令保持不变。 +- `OCRStorage` 接口抽象 + `InMemoryStorage` 实现 + `createIndexedDBStorage(dbName)` 工厂占位(A.2 阶段仍回退到 InMemoryStorage);P9-A.2.b 替换工厂内部即可接入真实 IndexedDB。 + +## 新增模块 + +| 文件 | 职责 | +| --- | --- | +| [`scripts/sync-tesseract-vendor.js`](../../../scripts/sync-tesseract-vendor.js) | 从 `node_modules/tesseract.js/dist/` + `node_modules/tesseract.js-core/` 同步资源到 `public/vendor/tesseract/{core,worker}/`;缺包时打印警告并 exit 0,不阻塞 CI / release:prepare | +| [`public/core/ocr/ocr-storage.js`](../../../public/core/ocr/ocr-storage.js) | `OCRStorage` 接口 + `InMemoryStorage` + `createIndexedDBStorage(dbName)` 工厂 + `defaultOCRStorage` 单例 | +| [`public/core/ocr/tesseract-engine.js`](../../../public/core/ocr/tesseract-engine.js) | `tesseractOCREngine` 实现 OCREngine 接口;`isAvailable` 检查 vendor 标志 `globalThis.__t2fTesseractVendorReady` + storage 中是否有 tessdata;`recognize` 当前仅完成拒绝路径(vendor-not-ready / tessdata-missing / runtime-not-wired),P9-A.2.b 接入真实推理 | +| [`public/core/ocr/tesseract-bootstrap.js`](../../../public/core/ocr/tesseract-bootstrap.js) | 副作用 import:注册 manifest 到 `defaultModelCache`(status: not-downloaded)+ 注册 engine 到 `defaultOCRRegistry`;必须在 `ocr-bootstrap.js`(placeholder)之后 import | + +`browser-transformer.js` 顶层 `import "./core/ocr/tesseract-bootstrap.js"` 触发副作用并导出全部 API。 + +## OCRStorage 抽象 + +```js +{ + has(key): Promise, + get(key): Promise, + put(key, value, meta = { sha256 }): Promise, + delete(key): Promise, + list(): Promise<{ key, size, sha256, updatedAt }[]>, + clear(): Promise, +} +``` + +- `InMemoryStorage` —— 基于 Map,供 Node 测试与浏览器降级。 +- `createIndexedDBStorage(dbName)` —— A.2 阶段仍回退 `InMemoryStorage`,A.2.b 在工厂内部接入真实 IDB(不改外部签名)。 +- `defaultOCRStorage` —— 浏览器/Tauri 环境检测到 `globalThis.indexedDB` 时返回 `createIndexedDBStorage()`,否则 `InMemoryStorage`。 + +错误编号:`OCR_STORAGE_INVALID_KEY` / `OCR_STORAGE_INVALID_VALUE`。 + +## TesseractEngine 状态机 + +``` +vendorReady? = globalThis.__t2fTesseractVendorReady +tessdataReady? = await defaultOCRStorage.has("tesseract/chi_sim.traineddata" + || "tesseract/eng.traineddata") + +isAvailable() === vendorReady && _tessdataReady + │ + └── _tessdataReady 由 ensureProbe() 异步刷新 +``` + +recognize 拒绝路径(A.2 阶段): +- `!vendorReady` → `OCR_UNAVAILABLE` (reason: `vendor-not-ready`) +- `!language found in storage` → `OCR_UNAVAILABLE` (reason: `tessdata-missing`) +- `!image` → `OCR_ENGINE_FAILED` (reason: `missing-image`) +- 通过以上检查 → `OCR_ENGINE_FAILED` (reason: `runtime-not-wired`),提示 P9-A.2.b 接入。 + +A.2.b 接入路径:替换 `runtime-not-wired` 分支为:动态 import `/vendor/tesseract/...` → 初始化 worker → 读 tessdata blob → recognize → 包装为 OCRResult。 + +## Tauri CSP + +``` +default-src 'self'; +script-src 'self' 'wasm-unsafe-eval'; ← P9-A.2 新增 wasm-unsafe-eval +style-src 'self' 'unsafe-inline'; +img-src 'self' data: blob:; +frame-src 'self' blob:; +worker-src 'self' blob:; ← P9-A.1 已存在 +connect-src 'self'; ← 不加 wasm 远程;wasm 仅同源加载 +object-src 'none'; +base-uri 'self' +``` + +`'wasm-unsafe-eval'` 是允许 `WebAssembly.instantiate`/`WebAssembly.compile` 的 CSP 关键字,对应 tesseract.js wasm 初始化路径。`connect-src 'self'` 保留 —— vendor 资源同源、tessdata 由用户本地选择文件 / IDB 读取,不联网。 + +## Vendor 同步 + +`sync-tesseract-vendor.js` 流程: +1. 检测 `node_modules/tesseract.js` 是否存在。缺失 → 打印警告并 exit 0(与 `optionalDependencies` 兼容;CI 环境不安装 OCR 时不阻塞)。 +2. 创建 `public/vendor/tesseract/{core,worker}/`。 +3. 从 `node_modules/tesseract.js/dist/` 拷贝 `.js`/`.mjs`/`.map`:以 `worker` 开头的进 `worker/`,其余进 `core/`。 +4. 从 `node_modules/tesseract.js-core/`(含 `dist/` 子目录的情况兼容)拷贝 `.js`/`.wasm`/`.map` 到 `core/`。 +5. 打印同步结果摘要(`worker=true|false, core=true|false`)。 + +`release:prepare` 现在顺序执行 `sync-pdfjs-vendor.js` → `sync-tesseract-vendor.js` → `prepare-release.js`。 + +## 守门 + +`scripts/ocr-baseline-test.js` 扩展为 15 组断言(在原 10 组基础上 +5): +- TesseractEngine 注册到 registry + ID/taskCapabilities/manifestId 正确 + `isAvailable() === false`。 +- Manifest 在 `defaultModelCache` 中 status 为 `not-downloaded`。 +- `recognize` 在三个阶段(vendor-not-ready / tessdata-missing / runtime-not-wired)抛对应 code 与 reason;用 `markTesseractVendorReady` + `_storage.put` 模拟状态切换。 +- `InMemoryStorage` 完整 CRUD + 错误用例(空 key、非 buffer value)。 +- `defaultOCRStorage` 在 Node 环境是 InMemoryStorage 实例。 + +`scripts/local-security-test.js` 把 `public/core/ocr/{ocr-storage,tesseract-engine,tesseract-bootstrap}.js` 加入 `ALLOWED_PUBLIC_FILES`(IDB / 同源 vendor fetch 合规使用)+ `STRICT_LOCAL_ONLY_FILES`(守门:不得出现远程协议字符串)。`public/vendor/tesseract/**` 通过扩展的 `isLocalVendorAsset` 例外处理。 + +`scripts/local-model-direction-test.js` multiModel 关键词扩展:`TesseractEngine` / `defaultOCRStorage` / `tesseract.js`。 + +## 不引入 + +- 不实际跑 OCR 推理。 +- 不实现 tessdata IndexedDB 真实 I/O(A.2.b)。 +- 不增加「导入 tessdata」UI 按钮(A.2.b)。 +- 不修改 fixed-layout 或扫描 PDF 路径(P9-B)。 +- 不引入 npm 依赖以外的运行时 package;仅 optionalDependencies 加 `tesseract.js@^5.1.1`,缺包时 vendor 与 npm install 都不阻塞。 + +## 未来扩展(A.2.b / A.3) + +- `createIndexedDBStorage` 工厂内部接入真实 IDB(用 native API,避免引入 `idb` / `dexie` 等额外依赖)。 +- 安全中心「模型缓存」card 加「导入 tessdata」按钮:`` → `sha256Hex(buffer)` → `defaultOCRStorage.put("tesseract/{lang}.traineddata", buffer, { sha256 })` → `setStatus(STATUS_AVAILABLE)`。 +- `TesseractEngine.recognize` 接入真实推理:动态 import `/vendor/tesseract/core/tesseract.min.js` → 初始化 worker(worker.min.js)→ 加载 wasm + tessdata → recognize → 包装为 OCRResult。 +- 真实 PNG fixture(10–20 KB 含中英文)驱动端到端测试。 +- 扫描 PDF 渲染(pdfjs canvas → toBlob → recognize)。 +- `summarizeOCRResult` 回填 Repair Engine 的 `modelReview`,让 QualityReport 携带 OCR 证据。 diff --git a/docs/superpowers/specs/2026-05-28-p9a2b-tessdata-runtime-design.md b/docs/superpowers/specs/2026-05-28-p9a2b-tessdata-runtime-design.md new file mode 100644 index 0000000..42319e7 --- /dev/null +++ b/docs/superpowers/specs/2026-05-28-p9a2b-tessdata-runtime-design.md @@ -0,0 +1,122 @@ +# P9-A.2.b tessdata IndexedDB + UI 启用 + 真实 OCR 接入 + +状态:生效 +日期:2026-05-28 +前置基础:P9-A.2 Tesseract Runtime Vendor + 骨架 / S3 Model Cache / [2026-05-28-lightweight-default-bundle-direction.md](2026-05-28-lightweight-default-bundle-direction.md) +后续阶段:P9-A.3 端到端 PNG + 扫描 PDF / P9-B OCR→FixedLayoutModel + +## 目标 + +把 P9-A.2 留在「骨架」状态的三块拼上: + +1. **真实 IDB I/O** — `createIndexedDBStorage` 返回真正读写 `trans2former-ocr-cache` 数据库的 `IndexedDBStorage`;Node / 无 IDB 环境 fallback 到 `InMemoryStorage`。 +2. **UI 启用流程** — 安全中心「模型缓存」card 的 Tesseract 行新增「导入 chi_sim.traineddata」「导入 eng.traineddata」「清除缓存」三个按钮;走 `` → ArrayBuffer → `sha256Hex` → `defaultOCRStorage.put` → `defaultModelCache.setStatus(STATUS_AVAILABLE)`。 +3. **真实推理路径** — `TesseractEngine.recognize` 完整接入 `loadTesseractRuntime` + `createTesseractWorker` + `runRecognize`,把 tesseract.js 返回的结构映射成 `OCRResult`。 + +但本轮**仍不挂入 convert pipeline**——`enhanceWithOCR(model, { engine })` 作为独立函数提供,PNG reader 保持同步签名;A.3 接入 convert async stage 时再挂。 + +## 数据流 + +``` +┌──────────────────────┐ ┌──────────────────────┐ +│ 用户上传 PNG / 扫描 PDF │ → │ formats/png.js reader │ → SemanticDoc (image asset) +└──────────────────────┘ └─────────┬─────────────┘ + │ + │ async enhanceWithOCR(model) + ▼ + ┌───────────────────────────────────────────────────┐ + │ defaultOCRRegistry.pickForTask("ocr-text") │ + │ → tesseractOCREngine.isAvailable()? if no: │ + │ return model + OCR_UNAVAILABLE warning │ + └───────────────────────────────────────────────────┘ + │ yes + ▼ + ┌───────────────────────────────────────────────────┐ + │ loadTesseractRuntime() → import('/vendor/...') │ + │ defaultOCRStorage.get("tesseract/chi_sim...") │ + │ createTesseractWorker({ namespace, language, buf }) │ + │ runRecognize(worker, image) │ + │ mapTesseractResultToOCR(...) → OCRResult │ + └─────────────────────────┬─────────────────────────┘ + │ + ▼ + ┌───────────────────────────────────────────────────┐ + │ model.blocks.push(paragraphs from OCR.fullText) │ + │ model.metadata.modelReview = summarizeOCRResult(…) │ + │ warnings: low-confidence / engine-failed │ + └───────────────────────────────────────────────────┘ +``` + +## 新增 / 改造模块 + +| 文件 | 职责 | +| --- | --- | +| [`public/core/ocr/indexeddb-storage.js`](../../../public/core/ocr/indexeddb-storage.js) | `IndexedDBStorage` 类;单数据库 + 双 object store(`tessdata`, `metadata`);put 用单事务原子写两个 store;错误统一抛 `OCR_STORAGE_IDB_ERROR` | +| [`public/core/ocr/ocr-storage.js`](../../../public/core/ocr/ocr-storage.js) | `createIndexedDBStorage` 改为返回 `LazyIndexedDBStorage`(dynamic import `indexeddb-storage.js`),无 `globalThis.indexedDB` 时 fallback 到 `InMemoryStorage` | +| [`public/core/ocr/tesseract-runtime.js`](../../../public/core/ocr/tesseract-runtime.js) | `loadTesseractRuntime` 动态 import `/vendor/tesseract/core/tesseract.min.js`,失败抛 `OCR_VENDOR_LOAD_FAILED`;`createTesseractWorker` 用 vendor 路径 + tessdata blob URL;`runRecognize` 把 tesseract data 映射成 `OCRResult` | +| [`public/core/ocr/tesseract-engine.js`](../../../public/core/ocr/tesseract-engine.js) | `recognize` 真实接入:vendor + storage 检查 → `createTesseractWorker` → `runRecognize` → `disposeWorker`;失败统一抛 `OCR_ENGINE_FAILED` | +| [`public/core/ocr/png-ocr.js`](../../../public/core/ocr/png-ocr.js) | `enhanceWithOCR(model, { engine })` 函数:解析第一个 image asset → 调用 engine.recognize → 追加 paragraph blocks + `metadata.modelReview`;engine 不可用时返回原 model + `OCR_UNAVAILABLE`;低置信度发 `OCR_LOW_CONFIDENCE` | + +## UI 接入(安全中心) + +[`public/security-center.js`](../../../public/security-center.js) `renderModelCache` 渲染每条 manifest 时,对 `task === "ocr-text" && engine === "tesseract"` 追加: +- 「导入 chi_sim.traineddata」按钮(`data-import-tessdata data-language="chi_sim"`) +- 「导入 eng.traineddata」按钮(`data-import-tessdata data-language="eng"`) +- 「清除缓存」按钮(`data-clear-tessdata`),仅当状态为 `available` 时启用 + +事件委托在 `init()` 的 `dialog.addEventListener("click")` 中: +- `[data-import-tessdata]` → `importTessdata(dialog, target)`: + 1. 把 `` 显式 `.click()` 打开。 + 2. 收到 `change` 事件 → `file.arrayBuffer()` → `sha256Hex(buffer)` → `defaultOCRStorage.put("tesseract/{lang}.traineddata", buffer, { sha256 })`。 + 3. `tesseractOCREngine.ensureProbe()` 刷新 `_tessdataReady`;`markTesseractVendorReady(true)` 让 `isAvailable()` 在 vendor 已同步时返回 true。 + 4. `defaultModelCache.setStatus(manifestId, STATUS_AVAILABLE, { language, sha256, size })`。 + 5. 状态消息显示在 `[data-model-cache-status]` 区域,分 `info/success/error` 三个 level。 +- `[data-clear-tessdata]` → `clearTessdata(dialog, target)`:循环 delete chi_sim/eng,刷新 probe,`setStatus(STATUS_NOT_DOWNLOADED)`。 + +UI 改动仅在安全中心,不动 workbench 或路由。 + +## 错误编号 + +新增: +- `OCR_VENDOR_LOAD_FAILED` —— tesseract.js vendor 资源加载失败(包含缺包、版本不兼容、缺 createWorker)。 +- `OCR_STORAGE_IDB_ERROR` —— IndexedDB 操作失败。 + +复用: +- `OCR_UNAVAILABLE` (info) —— engine 未启用 / vendor 未就位 / tessdata 缺失。 +- `OCR_ENGINE_FAILED` (lossy) —— worker 初始化失败 / recognize 抛错。 +- `OCR_LOW_CONFIDENCE` (lossy) —— averageConfidence < 0.6 阈值。 + +## 测试覆盖 + +[`scripts/ocr-baseline-test.js`](../../../scripts/ocr-baseline-test.js) 扩展到 20 组断言(原 15 组 + 5 组新增): +- `loadTesseractRuntime` 在 Node 环境抛 `OCR_VENDOR_LOAD_FAILED`(验证 dynamic import 失败路径)。 +- `tesseractOCREngine.recognize` 在 tessdata 已 put 但 vendor 仍未就位时抛 `OCR_VENDOR_LOAD_FAILED`(验证真实链路第一步生效)。 +- `enhanceWithOCR` 无可用 engine → 返回原 model + `OCR_UNAVAILABLE`。 +- `enhanceWithOCR` 用 stub engine → 追加 paragraph + 写入 `metadata.modelReview.ocr`。 +- `enhanceWithOCR` 低置信度 → 发 `OCR_LOW_CONFIDENCE`。 +- `sha256Hex` + `InMemoryStorage.put/list` 元数据正确(验证 UI 导入流程使用的 SHA-256 计算)。 + +[`scripts/local-security-test.js`](../../../scripts/local-security-test.js) 把 `indexeddb-storage.js` / `tesseract-runtime.js` / `png-ocr.js` 加入 `ALLOWED_PUBLIC_FILES` + `STRICT_LOCAL_ONLY_FILES`。`public/vendor/tesseract/**` 已通过 P9-A.2 的 `isLocalVendorAsset` 例外处理。 + +## 不引入 + +- 不引入 npm 依赖(`idb` / `dexie` / `fake-indexeddb` 等都不引入)。 +- 不挂 `enhanceWithOCR` 进 `format-registry.convert()`;A.3 接入 async pre-stage。 +- 不修改扫描 PDF 路径(pdfjs canvas → toBlob → recognize 是 A.3 工作)。 +- 不创建 PNG / tessdata fixture(npm test 用 stub engine 覆盖代码路径;真实 OCR 验证靠手动浏览器)。 +- 不持久化 manifest 状态:刷新浏览器后通过 `tesseractOCREngine.ensureProbe()` 重新检查 IDB 中是否有 tessdata,自动恢复 status。 + +## 验证 + +- `node scripts/ocr-baseline-test.js` —— 20 组断言全通过。 +- `npm test` —— 20 个原脚本(含扩展 ocr-baseline-test)全量通过。 +- `npm install tesseract.js` + `npm run vendor:tesseract` + `npm start` 浏览器端:进入安全中心,「模型缓存」card 显示 Tesseract 条目并出现三个按钮;点击「导入 chi_sim.traineddata」→ file picker → 选择本地 `.traineddata` 文件 → 状态切到「已就绪」+ 状态消息显示 sha256 前缀;点击「清除缓存」→ 状态回到「未启用」。 +- 浏览器控制台手动调用 `import("/browser-transformer.js").then(m => m.enhanceWithOCR(model, { engine }))` 可观察到 OCR enhance 真实跑通(前提:vendor + tessdata 都就位)。 + +## 未来扩展(A.3+) + +- 把 `enhanceWithOCR` 挂进 `format-registry.convert()` 作为 PNG 输入路径的 async pre-stage(让 PNG → md/txt 等转换自动包含 OCR 文本)。 +- 扫描 PDF:把 pdfjs 渲染到 canvas → `canvas.toBlob()` → 调用 `enhanceWithOCR` 多页合并。 +- Repair Engine:把 `OCRResult` 中低置信度行(line.confidence < threshold)转成 `replaceTextRun` 候选 action,让用户/上层决定是否替换。 +- 多语言:支持用户在 UI 中选择 chi_sim / chi_tra / eng / jpn 等多种语言,传给 `engine.recognize({ language })`。 +- 高级 OCR:把 PaddleOCR-VL / MinerU 作为新 engine 注册到 `defaultOCRRegistry`,复用 manifest + Storage 抽象。 diff --git a/docs/superpowers/specs/2026-05-28-p9a3-async-ocr-pipeline-design.md b/docs/superpowers/specs/2026-05-28-p9a3-async-ocr-pipeline-design.md new file mode 100644 index 0000000..bd8b7a8 --- /dev/null +++ b/docs/superpowers/specs/2026-05-28-p9a3-async-ocr-pipeline-design.md @@ -0,0 +1,110 @@ +# P9-A.3 PNG 异步 OCR 接入 + Repair Engine OCR 入口 + +状态:生效 +日期:2026-05-28 +前置基础:P9-A.2.b tessdata IndexedDB + UI 启用 + 真实 OCR / S2 Repair Engine / [2026-05-28-lightweight-default-bundle-direction.md](2026-05-28-lightweight-default-bundle-direction.md) +后续阶段:P9-A.4 扫描 PDF 渲染 + 端到端 / P9-B FixedLayoutModel + 视觉对比 / P9-C 转换后检验三层 + +## 目标 + +P9-A.2.b 把 `enhanceWithOCR` / IndexedDB / 安全中心 UI 全部接入,但没有任何代码路径会主动调用 enhanceWithOCR。本轮把 PNG 输入路径自动接到 OCR enhancement stage 上,让 OCR 文本进入 SemanticDoc,并且让 Repair Engine 学会从 OCR 元数据生成低置信度修复候选——为 P9-B 真模型审核打开入口。 + +P9-A.3 落地后: +- 浏览器/桌面端在导入 tessdata 后,PNG → md/txt/html 的输出会包含 OCR 识别文本;不再只是「OCR_UNAVAILABLE」占位。 +- `quality.modelReview.ocr` 字段含 lineCount / averageConfidence / engine / runtimeMs 等真实证据。 +- `quality.autoRepair.appliedActions` 或 `recommendations` 会出现 `replaceTextRun` 候选条目(针对低置信度行);P9-B 真模型审核接入后这些 action 可以自动 apply 到 SemanticDoc。 +- 既有 `convert()` / `convertContent()` 同步接口完全不变,所有现有调用方与测试都不受影响。 + +## 数据流 + +``` +convertContentAsync({ from: "png", to: "md" | "txt" | "html" | "json" | "pdf" }) + → registry.convertAsync(payload) + → prepareConversionModel(payload) // sync, 现有 P8-B 路径 + → if options.ocr.enabled !== false && from === "png": + await runOCRStage(model, ctx) // dynamic import; engine.recognize → enhance + ├ defaultOCRRegistry.pickForTask("ocr-text") + ├ enhanceWithOCR(model, { engine }) + │ ├ resolveAssetData(model) + │ ├ engine.recognize({ image }) → OCRResult + │ ├ append paragraph blocks from pages[].lines + │ ├ metadata.modelReview.ocr = summarizeOCRResult(result) + │ └ metadata.ocr.lines = [{ pageIndex, lineIndex, text, confidence, bbox, blockId }] + └ failure → withWarnings([OCR_ENGINE_FAILED]) → 返回原 model + → write({ model, to, ... }) // sync, writer 把 OCR paragraphs 一起渲染 + → _wrapWithRepairCycle(...) // sync, 与 convert() 共享 + └ defaultRepairEngine.runCycle + ├ detectLossyRepairHints (S2) + ├ detectRouteClassDegradation (S2) + └ detectOCRLowConfidence (P9-A.3 新增) + └ → replaceTextRun candidates 对 confidence < 0.55 的行 +``` + +## 新增模块 + +| 文件 | 职责 | +| --- | --- | +| [`public/core/ocr/ocr-stage.js`](../../../public/core/ocr/ocr-stage.js) | `runOCRStage(model, ctx)`:包装 `enhanceWithOCR` + 注入 `OCR_ENGINE_FAILED` warning 兜底;`getDefaultOCRLanguage` 提取语言默认值 | +| [`public/core/ocr/ocr-validator.js`](../../../public/core/ocr/ocr-validator.js) | `detectOCRLowConfidence(model)`:从 `metadata.ocr.lines` 取 confidence < 0.55 的行生成 `replaceTextRun` 候选;每页最多 8 条;evidence 含 engineId / language / bbox / pageIndex / lineIndex | +| `samples/png/t2f-sample.data-url.txt` | 80×24 灰度 PNG(白底黑字 "T2F")的 data URL,118 字节 base64 后约 182 字符;用于 OCR pipeline 端到端 stub 测试 | +| `samples/png/README.md` | 说明三个 tiny-{color} PNG 和 t2f-sample 的用途,并给出浏览器端真实 OCR 验证步骤 | + +## 修改文件 + +### `public/core/format-registry.js` +- 提取共享 helper:`_buildRepairCtx(...)` 与 `_wrapWithRepairCycle(...)`,让 sync `convert()` 与 async `convertAsync()` 复用同一套 repair / audit 包装代码。 +- 新增 `async convertAsync({ content, from, to, title, fileName, options })`:与 `convert()` 同样的入口校验和 prepareConversionModel;当 `options?.ocr?.enabled !== false && fromFormat === "png"` 时 `await dynamic-import runOCRStage(model, ctx)` 注入 OCR enhancement;之后走同一份 `_wrapWithRepairCycle`。 + +### `public/browser-transformer.js` +- 顶层 export `convertContentAsync(payload)` → `registry.convertAsync(payload)`。 +- 顶层 export `runOCRStage` / `getDefaultOCRLanguage` / `detectOCRLowConfidence`。 + +### `public/core/repair-engine.js` +- `createDefaultRepairEngine()` 在注册 `DEFAULT_VALIDATORS` 之后追加 `engine.registerValidator(detectOCRLowConfidence)`。`replaceTextRun` handler 已存在(S2),自动复用。 + +### `public/core/ocr/png-ocr.js` +- `enhanceWithOCR` 在追加 paragraph blocks 的同时把 `pages[].lines` 一一展开写入 `model.metadata.ocr.lines`,每条含 `{ pageIndex, lineIndex, text, confidence, bbox, blockId }`,让 `detectOCRLowConfidence` 能用 blockId 反查具体 paragraph。 + +### `public/app.js` +- import 新增 `convertContentAsync as convertInBrowserAsync`。 +- `convertWithWorker(payload)` 在 worker 不可用且 `payload.from === "png"` 时改走 `convertInBrowserAsync(payload)`,其他格式仍走同步 `convertInBrowser(payload)`。 + +## 错误编号与 warning + +| 编号 | 触发 | +| --- | --- | +| `OCR_UNAVAILABLE` | 无可用 engine 或 vendor/tessdata 缺失 | +| `OCR_ENGINE_FAILED` | recognize 抛错 / OCR stage 兜底 | +| `OCR_LOW_CONFIDENCE` | averageConfidence < 0.6 阈值 | +| `OCR_VENDOR_LOAD_FAILED` | dynamic import vendor 失败 | + +Repair Engine 用 `replaceTextRun` action 把低置信度行作为候选 —— **P9-A.3 阶段不强制 apply**(默认 confidence = 1 - ocrConfidence 接近 1,会进入 applied 路径,但实际 handler 因 before === after 不会修改文本);P9-B 真模型审核接入后,validator 会提供差异化 `after`,自动修复才会真实生效。 + +## 守门 + +[`scripts/ocr-baseline-test.js`](../../../scripts/ocr-baseline-test.js) 扩展为 26 组断言(原 20 + 6 新增): +- `convertContentAsync` 在 `options.ocr.enabled = false` 时返回 writer payload(与 sync convert 同 shape)。 +- `convertContentAsync` 注册 stub engine 后,输出 markdown / txt 包含 stub OCR 文本。 +- `runOCRStage` 持久化 `metadata.ocr.lines`,含 confidence + blockId。 +- `detectOCRLowConfidence` 对 confidence < 0.55 的行生成 `replaceTextRun` 候选;高置信度(>= 0.55)不生成。 +- `samples/png/t2f-sample.data-url.txt` fixture 通过 `convertContentAsync({ from: "png", to: "txt", options: { ocr: { enabled: false } } })` 不抛错。 + +[`scripts/local-security-test.js`](../../../scripts/local-security-test.js) 自动覆盖 `public/core/ocr/ocr-stage.js` + `ocr-validator.js`:两者均不含 fetch / localStorage / XHR / WebSocket,符合 local-only 默认规则。 + +[`scripts/local-model-direction-test.js`](../../../scripts/local-model-direction-test.js) 守门关键词新增 `convertContentAsync` / `runOCRStage` / `detectOCRLowConfidence`(在 multiModel 文件中)。 + +## 不引入 + +- 不引入 npm 依赖(无 idb / dexie / canvas polyfill)。 +- 不修改 Tauri CSP(A.2 已加 `'wasm-unsafe-eval'`)。 +- 不动 reader / writer(PNG reader 同步签名保留;OCR 是独立 async stage)。 +- 不修改产品矩阵(PNG 输出仍是 `["html","txt","json","pdf"]`,OCR 文本写入 SemanticDoc 后自动让 txt/html 输出携带)。 +- 不真实跑 OCR 在 npm test 中(stub engine 覆盖代码路径;真实 OCR 验证靠手动浏览器)。 + +## 未来扩展(A.4 / B 之后) + +- P9-A.4:把 pdfjs 渲染扫描 PDF 每页为 canvas → `canvas.convertToBlob()` → 调用 `enhanceWithOCR` 多页合并;自动判别扫描型(提取字数 < 阈值则视为扫描)。 +- Repair Engine 把高置信度 OCR 文本作为「auto-applied」(语言模型/字典加权后),而非候选。 +- 多语言 UI 切换(chi_tra / jpn / kor)。 +- 高级 OCR(PaddleOCR-VL / MinerU)作为新 engine 注册;表格 / 公式 / 复杂版面识别。 +- Repair Engine 接入语义 fuzzy match:把 OCR 低置信度文本与字典做模糊匹配,自动生成可信的 `after` 文本。 diff --git a/docs/superpowers/specs/2026-05-28-p9a4-scan-pdf-ocr-design.md b/docs/superpowers/specs/2026-05-28-p9a4-scan-pdf-ocr-design.md new file mode 100644 index 0000000..d305ec5 --- /dev/null +++ b/docs/superpowers/specs/2026-05-28-p9a4-scan-pdf-ocr-design.md @@ -0,0 +1,132 @@ +# P9-A.4 扫描 PDF OCR 检测 + Rasterizer 骨架 + 多页 stage + +状态:生效 +日期:2026-05-28 +前置基础:P9-A.3 PNG 异步 OCR 接入 / P9-A.2.b tessdata IDB + 真实 OCR / [2026-05-28-lightweight-default-bundle-direction.md](2026-05-28-lightweight-default-bundle-direction.md) +后续阶段:P9-B OCR → FixedLayoutModel / P9-C 转换后检验三层 / P9-D 高级 OCR + +## 目标 + +P9-A.3 已经把 PNG 路径接入 `convertContentAsync` + `runOCRStage`,但扫描 PDF 仍走 P8-B 文本提取路径——没有页面图像渲染,没有 OCR 调用。本轮在 OCR 模块新增「扫描 PDF 检测 + 页面 rasterize → 多页 OCR enhance」管线,让 `convertContentAsync` 自动识别扫描型 PDF 并接入 OCR 链路。本轮**不实现真实 pdfjs canvas 渲染**(留给 P9-B / 浏览器手动验证),但 stub rasterizer + stub engine 完整覆盖 npm test 端到端代码路径。 + +P9-A.4 落地后: +- `convertContentAsync({ from: "pdf", to: "txt" })` 检测到扫描 PDF(无 pdfjs payload 或提取字数 < 阈值)时,自动调用 `runScannedPdfOCRStage(model, ctx)`。 +- `runScannedPdfOCRStage` 依次 `rasterizer.rasterize({ pageIndex })` 渲染每页为 PNG data URL → `engine.recognize({ image })` → 把多页 OCR paragraphs 追加到 SemanticDoc。 +- `metadata.modelReview.ocr.pageCount` / `lineCount` / `averageConfidence` / `runtimeMs` 总览;`metadata.ocr.lines` 持久化每条带 `pageIndex` + `blockId`,Repair Engine 的 `detectOCRLowConfidence` validator 自动复用。 +- 现有同步 `convert()` 完全不变;不破坏任何现有测试。 + +## 数据流 + +``` +convertContentAsync({ from: "pdf", to: "txt" }) + → registry.convertAsync(payload) + → prepareConversionModel(payload) // sync, 原有 P8-B 文本提取 + → if options.ocr.enabled !== false && from === "pdf": + const { isScannedPdf } = await import("./ocr/pdf-rasterizer.js") + const detection = await isScannedPdf(content) + if (detection.scanned): + const { runScannedPdfOCRStage } = await import("./ocr/scan-pdf-stage.js") + model = await runScannedPdfOCRStage(model, { content, options, from, to }) + ├ rasterizer.countPages({ content }) + ├ for pageIndex in 0..min(maxScanPages, pageCount): + │ rasterizer.rasterize({ content, pageIndex, dpi }) + │ → { dataUrl, width, height } + │ engine.recognize({ image: dataUrl }) + │ → OCRResult + │ model.blocks.push(...paragraphsFromPageResult(result)) + │ metadata.ocr.lines.push({ pageIndex, lineIndex, text, confidence, bbox, blockId }) + ├ metadata.modelReview.ocr = { pageCount, lineCount, averageConfidence, runtimeMs, engine, language } + └ if averageConfidence < 0.6: OCR_LOW_CONFIDENCE + → write({ model, to, options }) // sync, 现有 writer + → _wrapWithRepairCycle(...) // sync, 与 convert() 共享 + └ defaultRepairEngine.runCycle + ├ detectLossyRepairHints (S2) + ├ detectRouteClassDegradation (S2) + └ detectOCRLowConfidence (P9-A.3) → replaceTextRun 候选 +``` + +## 新增模块 + +| 文件 | 职责 | +| --- | --- | +| [`public/core/ocr/pdf-rasterizer.js`](../../../public/core/ocr/pdf-rasterizer.js) | `isScannedPdf(content, options)` 启发式(基于 `expandPdfContentForTextExtraction` + 检测 `PDFJS_PAYLOAD_MARKER` + 字符阈值);`PdfPageRasterizer` 抽象 + `defaultPdfPageRasterizer`(Node 默认抛 `OCR_RASTERIZER_UNAVAILABLE`);`setPdfPageRasterizer(impl)` / `resetPdfPageRasterizer()` 让测试注入 stub | +| [`public/core/ocr/scan-pdf-stage.js`](../../../public/core/ocr/scan-pdf-stage.js) | `runScannedPdfOCRStage(model, ctx)`:多页 rasterize + enhance 合并;任何错误注入 `OCR_ENGINE_FAILED` warning 后返回原 model;`maxScanPages` / `dpi` / `language` 可通过 `options.ocr.*` 调整 | + +## isScannedPdf 启发式 + +```js +async function isScannedPdf(content, options = {}) { + try { + const expanded = await expandPdfContentForTextExtraction(content); + if (!expanded.includes("% Trans2Former PDFJS_TEXT_START")) { + return { scanned: true, reason: "no-pdfjs-payload" }; // pdfjs 无法解析或返回空 + } + const model = readPdf({ content: expanded }); + const extractedChars = countModelText(model); // 去空白后字符数 + return { + scanned: extractedChars < threshold, // 默认 300 + extractedChars, + threshold, + reason: extractedChars < threshold ? "low-extracted-text" : "text-pdf", + }; + } catch (error) { + return { scanned: true, reason: `extraction-failed:${error.message}` }; + } +} +``` + +- 不带 PDFJS_PAYLOAD 标记 → 视为扫描(pdfjs 解析失败或无文本对象)。 +- 带 PDFJS_PAYLOAD 但提取字符 < 300 → 视为扫描(典型扫描页面 + 偶尔字幕/页码)。 +- 带 PDFJS_PAYLOAD 且 ≥ 300 字符 → 文本 PDF,不走 OCR。 + +调用方可通过 `options.ocr.scanPdfThreshold` 调阈值。误判(文本 PDF 走 OCR)的兜底是 `runScannedPdfOCRStage` 在 engine 不可用时返回原 model + `OCR_UNAVAILABLE` warning,不阻塞写出。 + +## PdfPageRasterizer 接口 + +```js +{ + rasterize({ content, pageIndex, dpi }): Promise<{ dataUrl, width, height }>, + countPages({ content }): Promise, +} +``` + +- `defaultPdfPageRasterizer` 在 Node 默认实现抛 `OCR_RASTERIZER_UNAVAILABLE`。 +- 浏览器/Tauri 端**本轮不实现真实 pdfjs canvas 渲染** —— 留给 P9-B(FixedLayoutModel 需要相同的渲染能力)。 +- 测试通过 `setPdfPageRasterizer(stubImpl)` 注入桩。 + +## 错误编号 + +- `OCR_RASTERIZER_UNAVAILABLE` —— 默认 rasterizer 未实现(Node 环境或未注入实现)。 +- `OCR_RASTERIZER_FAILED` —— rasterize 抛出非业务错误。 +- `OCR_RASTERIZER_INVALID` —— `setPdfPageRasterizer` 接收的对象缺方法。 + +复用 `OCR_UNAVAILABLE` / `OCR_ENGINE_FAILED` / `OCR_LOW_CONFIDENCE` 与 PNG 路径一致。 + +## 守门 + +[`scripts/ocr-baseline-test.js`](../../../scripts/ocr-baseline-test.js) 扩展为 30 组断言(原 26 + 4 新增): +- `isScannedPdf` 对无 pdfjs payload 的最小 PDF 头返回 scanned=true。 +- `defaultPdfPageRasterizer.rasterize` 在 Node 抛 `OCR_RASTERIZER_UNAVAILABLE`。 +- 注入 stub rasterizer(返回固定 dataUrl)+ stub engine(返回 stub OCRResult),`runScannedPdfOCRStage` 2 页输入下追加 2 条 paragraph blocks + 写入 `metadata.modelReview.ocr.pageCount=2` + `metadata.ocr.lineCount=2`。 +- `convertContentAsync({ from: "pdf", to: "txt" })` 在 stub rasterizer + stub engine 下产出含 OCR 文本的 txt。 + +[`scripts/local-security-test.js`](../../../scripts/local-security-test.js) 把 `pdf-rasterizer.js` + `scan-pdf-stage.js` 加入 `ALLOWED_PUBLIC_FILES` + `STRICT_LOCAL_ONLY_FILES`。 + +[`scripts/local-model-direction-test.js`](../../../scripts/local-model-direction-test.js) 守门关键词加 `isScannedPdf` / `runScannedPdfOCRStage` / `defaultPdfPageRasterizer`。 + +## 不引入 + +- 不实现真实浏览器端 pdfjs canvas 渲染(P9-B)。 +- 不在仓库加扫描 PDF fixture(用拼接的最小 PDF 字节串覆盖代码路径)。 +- 不引入 npm 依赖(pdfjs-dist 仍是 optionalDependency;无 canvas / node-canvas)。 +- 不修改 Tauri CSP(A.2 已加 `'wasm-unsafe-eval'`)。 +- 不动同步 `convert()` 或 PNG 异步 stage(P9-A.3)。 + +## 未来扩展(P9-B 之后) + +- 真实浏览器端 `defaultPdfPageRasterizer.rasterize` 实现:dynamic import `/vendor/pdfjs/pdf.min.mjs` → `getDocument` → `page.render({ canvasContext })` → `canvas.toBlob()` → 返回 dataUrl。 +- 真实扫描 PDF fixture(约 10 KB 含中英文)入库 + 端到端 OCR 验证(需要 tessdata,留给手动浏览器)。 +- 多页并发 worker pool:目前是串行渲染 + OCR。 +- 混合型 PDF:每页判别扫描 vs 文本,分别处理,输出合并。 +- FixedLayoutModel:每页 OCR 结果 + bbox 写入 FixedLayoutModel,供 P9-B 的 layout / table 恢复使用。 +- 视觉对比:扫描 PDF OCR 输出 → 渲染回 PDF → SSIM 比对(P9-C)。 diff --git a/docs/superpowers/specs/2026-05-28-p9b-ocr-fixedlayout-design.md b/docs/superpowers/specs/2026-05-28-p9b-ocr-fixedlayout-design.md new file mode 100644 index 0000000..2837266 --- /dev/null +++ b/docs/superpowers/specs/2026-05-28-p9b-ocr-fixedlayout-design.md @@ -0,0 +1,126 @@ +# P9-B OCR → FixedLayoutModel + 浏览器端 rasterize 真实化 + +状态:生效 +日期:2026-05-28 +前置基础:P9-A.4 扫描 PDF OCR 检测 + Rasterizer 骨架 / P9-A.3 PNG 异步 OCR / S2 Repair Engine / [2026-05-28-lightweight-default-bundle-direction.md](2026-05-28-lightweight-default-bundle-direction.md) +后续阶段:P9-C 转换后检验三层 / P9-D 高级 OCR + +## 目标 + +P9-A.4 已经把扫描 PDF 检测 + PdfPageRasterizer 抽象 + 多页 OCR stage 接好,但 OCR 结果**没有进入 FixedLayoutModel** 这条规范模型路径,浏览器/Tauri 端 `defaultPdfPageRasterizer` 也仍是 Node throw 占位。本轮把这两块同步拼上,让扫描 PDF 转换产物携带真实 FixedLayoutModel + SemanticDoc 双视图,且浏览器端开箱即用。 + +P9-B 落地后: +- 扫描 PDF 输入 → `runScannedPdfOCRStage` → OCR 多页结果 → `mergeOCRResultsToFixedLayout` 产 FixedLayoutModel(含 bbox / confidence / readingOrderHint)→ `fixedLayoutToSemantic` 派生 paragraph blocks。 +- `model.fixedLayout` 暴露给 writer / Repair Engine / UI;`metadata.modelReview.ocr.fixedLayout` 含 `getFixedLayoutSummary`。 +- 浏览器/Tauri 端首次 rasterize/countPages 自动 dynamic import `pdf-rasterizer-browser.js`,再 dynamic import `/vendor/pdfjs/pdf.min.mjs`,用 canvas + page.render 渲染每页 → toDataURL;Node 仍抛 `OCR_RASTERIZER_UNAVAILABLE`,stub 测试零回归。 +- 同步 `convert()` / PNG enhance / Repair Engine handlers 完全不动。 + +## 数据流 + +``` +convertContentAsync({ from: "pdf", to: "txt" }) + → registry.convertAsync(payload) + → prepareConversionModel(payload) // sync, P8-B + → isScannedPdf(content)? if true: + runScannedPdfOCRStage(model, ctx) + ├ defaultPdfPageRasterizer.countPages(...) + ├ for pageIndex in 0..min(maxScanPages, pageCount): + │ defaultPdfPageRasterizer.rasterize({ pageIndex }) ← 浏览器自动加载 pdf-rasterizer-browser + │ engine.recognize({ image: dataUrl }) + │ collect pageResult + ├ mergeOCRResultsToFixedLayout(pageResults) → FixedLayoutModel + │ ├ ocrResultToFixedLayoutPage(...) per page (y → x 排序 + confidence) + │ └ metadata.readingOrder = "heuristic-yx" + ├ enhanced.fixedLayout = FixedLayoutModel + ├ fixedLayoutToSemantic(fixedLayout) → SemanticDoc blocks + ├ enhanced.blocks.push(...semanticFromLayout.blocks) + ├ warnings: MODEL_VISUAL_FIDELITY_LOST + MODEL_TEXT_ORDER_HEURISTIC (info) + ├ metadata.ocr.lines (Repair Engine 兼容) + └ metadata.modelReview.ocr.fixedLayout = getFixedLayoutSummary(fixedLayout) + → write({ model, to, options }) // sync, writer 看到完整 blocks + → _wrapWithRepairCycle(...) // sync, 与 convert() 共享 +``` + +## 新增 / 改造模块 + +| 文件 | 职责 | +| --- | --- | +| [`public/core/ocr/ocr-to-fixed-layout.js`](../../../public/core/ocr/ocr-to-fixed-layout.js) | `ocrResultToFixedLayoutPage(result, { pageNumber, pageIndex })`:按 bbox.y → bbox.x 排序 textRuns + 附 confidence;`mergeOCRResultsToFixedLayout(results)`:多页合并为 FixedLayoutModel + `metadata.readingOrder = "heuristic-yx"` + `metadata.ocr` 总览。复用 `createFixedLayoutModel` / `createPage` / `createTextRun`。 | +| [`public/core/ocr/pdf-rasterizer-browser.js`](../../../public/core/ocr/pdf-rasterizer-browser.js) | `createBrowserPdfPageRasterizer({ vendorUrl })`:dynamic import `/vendor/pdfjs/pdf.min.mjs` → `getDocument({ data })` → `pdf.getPage(n)` → `page.getViewport({ scale })` → 创建 `` + `page.render({ canvasContext, viewport })` → `canvas.toDataURL("image/png")` → `{ dataUrl, width, height }`;失败抛 `OCR_RASTERIZER_FAILED` 含 cause;浏览器 runtime 检测在 `ensureBrowserRuntime`。 | +| [`public/core/ocr/pdf-rasterizer.js`](../../../public/core/ocr/pdf-rasterizer.js) | 重构 `defaultPdfPageRasterizer`:缓存 `_injectedRasterizer`(setter)+ `_autoBrowserImpl`(lazy dynamic import `pdf-rasterizer-browser.js`);`rasterize/countPages` 优先级 inject → auto-browser → throw `OCR_RASTERIZER_UNAVAILABLE`;`resetPdfPageRasterizer` 清两个缓存。 | +| [`public/core/ocr/scan-pdf-stage.js`](../../../public/core/ocr/scan-pdf-stage.js) | 收集每页 OCR result → `mergeOCRResultsToFixedLayout` → `model.fixedLayout` + `fixedLayoutToSemantic` 派生 blocks + 发 `MODEL_VISUAL_FIDELITY_LOST` / `MODEL_TEXT_ORDER_HEURISTIC` info warning + `metadata.modelReview.ocr.fixedLayout = getFixedLayoutSummary(...)`;`metadata.ocr.lines` 仍为 Repair Engine validator 提供。 | +| [`public/core/models/fixed-layout.js`](../../../public/core/models/fixed-layout.js) | `createTextRun` 新增 `confidence`(默认 0,clamp 到 [0,1]);`createPage` 新增 `readingOrderHint`(默认 "")。不破坏现有 OFD / PDF reader 调用。 | + +## 阅读顺序启发式 + +`ocrResultToFixedLayoutPage` 内部对每页 lines 做: + +```js +lines.sort((a, b) => { + if (a.bbox.y !== b.bbox.y) return a.bbox.y - b.bbox.y; // 上 → 下 + return a.bbox.x - b.bbox.x; // 左 → 右 +}); +``` + +简单 y → x 启发式不区分多栏,但配合 `MODEL_TEXT_ORDER_HEURISTIC` info warning 让上层(用户 / Repair Engine)知道这是粗糙顺序;后续 P9-C/D 可以扩展为 multi-column detection / heading 推断。 + +## 浏览器端 rasterize 自动加载 + +```js +// pdf-rasterizer.js (P9-B 重构后) +async function tryLoadBrowserRasterizer() { + if (_autoBrowserImpl) return _autoBrowserImpl; + if (_autoBrowserLoadFailed) return null; + if (!isBrowserRuntime()) { _autoBrowserLoadFailed = true; return null; } + try { + const mod = await import("./pdf-rasterizer-browser.js"); + _autoBrowserImpl = mod.createBrowserPdfPageRasterizer(); + return _autoBrowserImpl; + } catch (error) { + _autoBrowserLoadFailed = true; + return null; + } +} + +defaultPdfPageRasterizer.rasterize(args) → + injected? injected.rasterize : + auto-browser? autoImpl.rasterize : + throw OCR_RASTERIZER_UNAVAILABLE +``` + +测试通过 `setPdfPageRasterizer(stub)` 始终优先 inject;`resetPdfPageRasterizer()` 同时清两个缓存(让下个测试重新走 fallback 路径)。 + +## 错误编号 + +复用:`OCR_RASTERIZER_UNAVAILABLE` / `OCR_RASTERIZER_FAILED` / `OCR_RASTERIZER_INVALID` / `OCR_UNAVAILABLE` / `OCR_ENGINE_FAILED` / `OCR_LOW_CONFIDENCE`。 + +新 warning:`MODEL_VISUAL_FIDELITY_LOST`(info) / `MODEL_TEXT_ORDER_HEURISTIC`(info)—— 由 `runScannedPdfOCRStage` 发出。 + +## 守门 + +[`scripts/ocr-baseline-test.js`](../../../scripts/ocr-baseline-test.js) 扩展为 34 组断言(原 30 + 4 新增): +- `ocrResultToFixedLayoutPage` 把 OCRResult 转为 textRuns 按 y/x 排序的 FixedLayoutModel.page,confidence 字段被携带。 +- `mergeOCRResultsToFixedLayout` 多页合并 + `getFixedLayoutSummary` 计数正确 + `fixedLayoutToSemantic` 派生 blocks。 +- `runScannedPdfOCRStage` stub 端到端后:`enhanced.fixedLayout.pages.length === 2`、`metadata.modelReview.ocr.fixedLayout.pageCount === 2`、`warnings` 含 `MODEL_VISUAL_FIDELITY_LOST` + `MODEL_TEXT_ORDER_HEURISTIC`。 +- `defaultPdfPageRasterizer` 优先级:Node + 无 inject → 抛;inject stub → stub 生效;reset → 抛回去。 + +[`scripts/local-security-test.js`](../../../scripts/local-security-test.js) 把 `ocr-to-fixed-layout.js` + `pdf-rasterizer-browser.js` 加入 `ALLOWED_PUBLIC_FILES` + `STRICT_LOCAL_ONLY_FILES`。 + +[`scripts/local-model-direction-test.js`](../../../scripts/local-model-direction-test.js) 守门关键词新增 `ocrResultToFixedLayoutPage` / `mergeOCRResultsToFixedLayout` / `createBrowserPdfPageRasterizer` / `MODEL_TEXT_ORDER_HEURISTIC`。 + +## 不引入 + +- 不实现高级阅读顺序算法(multi-column / heading detection):留给 P9-C / P9-D。 +- 不在仓库加真实扫描 PDF fixture:用最小 PDF 头 + stub rasterizer + stub engine 覆盖代码路径。 +- 不引入 npm 依赖:pdfjs-dist 仍 optionalDependency;canvas 用浏览器原生 API。 +- 不修改 Tauri CSP(A.2 已加 `'wasm-unsafe-eval'`)。 +- 不动同步 `convert()` / PNG enhance / Repair Engine handlers / 转换核心 / 其它 reader writer / UI 路由。 +- 不修改产品矩阵;PDF 输出仍是 `["md","html","txt","json","xml","docx","pdf"]`。 + +## 未来扩展(P9-C / D) + +- 真实扫描 PDF fixture(约 10 KB 含中英文)入库 + 浏览器端 OCR 回归测试(需要 tessdata,留给手动浏览器)。 +- 阅读顺序进阶:multi-column 检测、heading 推断、table 区域识别。 +- FixedLayoutModel 写出:PDF / OFD writer 直接消费 `model.fixedLayout`,不经过 SemanticDoc 降级。 +- 视觉对比(SSIM):原始 PDF 渲染 vs 输出回 PDF 渲染做 SSIM 对比,写入 `qualityReport.layoutFidelity / visualFidelity`(P9-C 工作)。 +- 高级 OCR runtime(PaddleOCR-VL / MinerU)作为新 engine 注册到 `defaultOCRRegistry`,覆盖表格 / 公式 / 复杂版面。 diff --git a/package.json b/package.json index c617d19..b2272af 100644 --- a/package.json +++ b/package.json @@ -8,11 +8,12 @@ "start": "node src/web-server.js", "web": "node src/web-server.js", "vendor:pdfjs": "node scripts/sync-pdfjs-vendor.js", - "release:prepare": "node scripts/sync-pdfjs-vendor.js && node scripts/prepare-release.js", + "vendor:tesseract": "node scripts/sync-tesseract-vendor.js", + "release:prepare": "node scripts/sync-pdfjs-vendor.js && node scripts/sync-tesseract-vendor.js && node scripts/prepare-release.js", "desktop:check": "node scripts/desktop-shell-test.js", "desktop:dev": "npm exec @tauri-apps/cli -- dev", "desktop:build": "npm exec @tauri-apps/cli -- build", - "test": "node scripts/smoke-test.js && node scripts/conversion-snapshot-test.js && node scripts/conversion-capability-audit-test.js && node scripts/conversion-quality-test.js && node scripts/format-integrity-test.js && node scripts/worker-payload-test.js && node scripts/browser-smoke-test.js && node scripts/workbench-queue-test.js && node scripts/desktop-shell-test.js && node scripts/local-security-test.js && node scripts/resource-budget-test.js && node scripts/p2-responsiveness-test.js && node scripts/p4-p5-p6-test.js && node scripts/p7-release-productization-test.js && node scripts/release-readiness-test.js" + "test": "node scripts/smoke-test.js && node scripts/conversion-snapshot-test.js && node scripts/conversion-capability-audit-test.js && node scripts/product-matrix-docs-test.js && node scripts/conversion-quality-test.js && node scripts/format-integrity-test.js && node scripts/worker-payload-test.js && node scripts/browser-smoke-test.js && node scripts/workbench-queue-test.js && node scripts/desktop-shell-test.js && node scripts/local-security-test.js && node scripts/local-model-direction-test.js && node scripts/repair-engine-test.js && node scripts/model-cache-test.js && node scripts/ocr-baseline-test.js && node scripts/resource-budget-test.js && node scripts/p2-responsiveness-test.js && node scripts/p4-p5-p6-test.js && node scripts/p7-release-productization-test.js && node scripts/release-readiness-test.js" }, "keywords": [ "converter", @@ -37,6 +38,7 @@ }, "homepage": "https://github.com/Vantalens/Trans2Former#readme", "optionalDependencies": { - "pdfjs-dist": "^5.7.284" + "pdfjs-dist": "^5.7.284", + "tesseract.js": "^5.1.1" } } diff --git a/public/app.js b/public/app.js index 890d69c..e88a539 100644 --- a/public/app.js +++ b/public/app.js @@ -1,5 +1,6 @@ import { convertContent as convertInBrowser, + convertContentAsync as convertInBrowserAsync, detectFormatFromName, getAllowedOutputFormats, getFormatCapabilities, @@ -28,6 +29,7 @@ import { } from "./core/workbench-state.js"; import { readBlobAsDecodedText } from "./core/text-decoding.js"; import { expandPdfContentForTextExtraction } from "./formats/pdf.js"; +import { openPreview } from "./router.js"; const inputContent = document.getElementById("inputContent"); const sourcePane = document.querySelector(".source-pane"); @@ -46,6 +48,7 @@ const outputUndoButton = document.getElementById("outputUndoButton"); const outputRedoButton = document.getElementById("outputRedoButton"); const outputCheckpointButton = document.getElementById("outputCheckpointButton"); const openPdfPreviewButton = document.getElementById("openPdfPreviewButton"); +const openStandalonePreviewButton = document.getElementById("openStandalonePreviewButton"); const errorDetailsPanel = document.getElementById("errorDetailsPanel"); const errorDetailsSummary = document.getElementById("errorDetailsSummary"); const errorCategory = document.getElementById("errorCategory"); @@ -890,10 +893,37 @@ function updateDownloadState(enabled) { if (enabled) { downloadOutputButton.classList.remove("disabled"); openPdfPreviewButton.disabled = !lastOutputIsPdf; + if (openStandalonePreviewButton) openStandalonePreviewButton.disabled = false; } else { downloadOutputButton.classList.add("disabled"); downloadOutputButton.href = "#"; openPdfPreviewButton.disabled = true; + if (openStandalonePreviewButton) openStandalonePreviewButton.disabled = true; + } +} + +function openCurrentOutputInPreview() { + if (!currentOutputType || currentOutputType === "none") return; + const payload = { + source: { + format: fromFormatSelect?.value || "", + fileName: currentFileName || "", + }, + output: { + type: currentOutputType, + format: currentOutputFormat || "", + mime: currentOutputMime || "", + text: outputEditor?.value || textOutputPreview?.textContent || "", + blobUrl: currentOutputBlobUrl || "", + printHtml: currentPrintHtml || "", + isPdf: Boolean(lastOutputIsPdf), + }, + meta: { generatedAt: Date.now() }, + }; + try { + openPreview(payload); + } catch (error) { + setStatus(`无法打开独立预览:${error?.message || error}`, "error"); } } @@ -1202,6 +1232,9 @@ function releaseConversionResources() { function convertWithWorker(payload) { const worker = createConvertWorker(); if (!worker) { + if (String(payload?.from || "").toLowerCase() === "png") { + return Promise.resolve(convertInBrowserAsync(payload)); + } return Promise.resolve(convertInBrowser(payload)); } @@ -1558,6 +1591,9 @@ cancelTransformButton.addEventListener("click", () => { setStatus("转换已取消", "info"); }); openPdfPreviewButton.addEventListener("click", printCurrentPdf); +if (openStandalonePreviewButton) { + openStandalonePreviewButton.addEventListener("click", openCurrentOutputInPreview); +} outputUndoButton?.addEventListener("click", () => { if (outputVersionIndex > 0) { @@ -1634,4 +1670,5 @@ bootstrapInitialSample(); syncMarkdownProfileControl(); syncPdfPaperControl(); openPdfPreviewButton.disabled = true; +if (openStandalonePreviewButton) openStandalonePreviewButton.disabled = true; updateConversionProgress({ stage: "idle", progress: 0 }); diff --git a/public/browser-transformer.js b/public/browser-transformer.js index e2ba59a..e2ce542 100644 --- a/public/browser-transformer.js +++ b/public/browser-transformer.js @@ -250,6 +250,124 @@ export function listFormats() { export { normalizeFormat }; export { getAllowedOutputFormats }; export { expandPdfContentForTextExtraction }; +export { defaultRepairEngine, RepairEngine, MIN_CONFIDENCE } from "./core/repair-engine.js"; +export { REPAIR_ACTION_TYPES, createRepairAction, validateRepairAction } from "./core/repair-actions.js"; +export { + MODEL_MANIFEST_SCHEMA_VERSION, + MODEL_TASKS, + MODEL_ENGINES, + MODEL_QUANTIZATIONS, + FALLBACK_STRATEGIES, + createModelManifest, + validateModelManifest, + summarizeManifest, +} from "./core/model-cache/manifest.js"; +export { sha256Hex, verifyChecksum } from "./core/model-cache/checksum.js"; +export { + MODEL_CACHE_ROOT, + getCacheKey, + getCacheDirectory, + parseCacheKey, + getCacheFilePath, +} from "./core/model-cache/cache-paths.js"; +export { + STATUS_NOT_DOWNLOADED, + STATUS_IMPORTING, + STATUS_VERIFYING, + STATUS_AVAILABLE, + STATUS_DEGRADED, + STATUS_DISABLED, + MODEL_CACHE_STATUSES, + ModelCacheRegistry, + defaultModelCache, +} from "./core/model-cache/availability.js"; +export { + getFirstEnableHint, + getOfflineFallbackHint, + getClearCacheHint, + getStatusLabel, + getTaskLabel, + listKnownTaskLabels, +} from "./core/model-cache/ui-text.js"; +import "./core/ocr/ocr-bootstrap.js"; +import "./core/ocr/tesseract-bootstrap.js"; +export { + OCR_RESULT_SCHEMA_VERSION, + OCR_LANGUAGES, + createOCRResult, + validateOCRResult, + summarizeOCRResult, +} from "./core/ocr/ocr-result.js"; +export { + OCREngineRegistry, + defaultOCRRegistry, +} from "./core/ocr/ocr-engine.js"; +export { + placeholderOCREngine, + PLACEHOLDER_OCR_MANIFEST_ID, +} from "./core/ocr/placeholder-engine.js"; +export { + OCR_UNAVAILABLE, + OCR_LOW_CONFIDENCE, + OCR_ENGINE_FAILED, + OCR_DEGRADED_ROUTE, + OCR_WARNING_CODES, + createOCRUnavailableWarning, + createOCREngineFailedWarning, + createOCRLowConfidenceWarning, + createOCRDegradedRouteWarning, +} from "./core/ocr/ocr-warnings.js"; +export { ensureOCRBootstrap } from "./core/ocr/ocr-bootstrap.js"; +export { + tesseractOCREngine, + TESSERACT_MANIFEST_ID, + markTesseractVendorReady, +} from "./core/ocr/tesseract-engine.js"; +export { ensureTesseractBootstrap } from "./core/ocr/tesseract-bootstrap.js"; +export { + InMemoryStorage, + createIndexedDBStorage, + defaultOCRStorage, +} from "./core/ocr/ocr-storage.js"; +export { IndexedDBStorage } from "./core/ocr/indexeddb-storage.js"; +export { + OCR_VENDOR_LOAD_FAILED, + TESSERACT_VENDOR_PATHS, + loadTesseractRuntime, + createTesseractWorker, + runRecognize, + disposeWorker, +} from "./core/ocr/tesseract-runtime.js"; +export { enhanceWithOCR } from "./core/ocr/png-ocr.js"; +export { runOCRStage, getDefaultOCRLanguage } from "./core/ocr/ocr-stage.js"; +export { detectOCRLowConfidence } from "./core/ocr/ocr-validator.js"; +export { + isScannedPdf, + defaultPdfPageRasterizer, + setPdfPageRasterizer, + resetPdfPageRasterizer, + OCR_RASTERIZER_UNAVAILABLE, + OCR_RASTERIZER_FAILED, +} from "./core/ocr/pdf-rasterizer.js"; +export { + runScannedPdfOCRStage, + MODEL_VISUAL_FIDELITY_LOST, + MODEL_TEXT_ORDER_HEURISTIC, +} from "./core/ocr/scan-pdf-stage.js"; +export { + ocrResultToFixedLayoutPage, + mergeOCRResultsToFixedLayout, + READING_ORDER_HEURISTIC, +} from "./core/ocr/ocr-to-fixed-layout.js"; +export { createBrowserPdfPageRasterizer } from "./core/ocr/pdf-rasterizer-browser.js"; +export { + createFixedLayoutModel, + createPage as createFixedLayoutPage, + createTextRun as createFixedLayoutTextRun, + createBbox as createFixedLayoutBbox, + getFixedLayoutSummary, +} from "./core/models/fixed-layout.js"; +export { fixedLayoutToSemantic } from "./core/models/mappers.js"; export function getRouteTemperature(from, to) { return registry.getRouteTemperature(from, to); @@ -292,3 +410,7 @@ export function renderPreviewHtml(content, fromFormat, title = "document") { export function convertContent({ content, from, to, title = "document", fileName = "", options = {} }) { return registry.convert({ content, from, to, title, fileName, options }); } + +export async function convertContentAsync({ content, from, to, title = "document", fileName = "", options = {} }) { + return registry.convertAsync({ content, from, to, title, fileName, options }); +} diff --git a/public/core/format-registry.js b/public/core/format-registry.js index 0c0d2d0..36707ac 100644 --- a/public/core/format-registry.js +++ b/public/core/format-registry.js @@ -1,5 +1,6 @@ import { ConversionError } from "./conversion-error.js"; import { ensureDocumentAudit } from "./document-audit.js"; +import { defaultRepairEngine } from "./repair-engine.js"; import { createWarning, withWarnings } from "./warnings.js"; const FORMAT_ALIASES = { @@ -133,6 +134,10 @@ export function getAllowedOutputFormats(from) { return [...(PRODUCT_MATRIX_BY_INPUT[normalizeFormat(from)] || [])]; } +export function getKnownInputFormats() { + return Object.keys(PRODUCT_MATRIX_BY_INPUT); +} + function getPayload(model, type) { if (type === "SemanticDoc") return model; if (type === "WorkbookModel") return model.workbook; @@ -415,8 +420,131 @@ export class ConverterRegistry { }); } + _buildRepairCtx({ content, fromFormat, toFormat, title, fileName, options }) { + return { + content, + from: fromFormat, + to: toFormat, + title, + fileName, + options, + read: ({ content: readContent, from: readFrom, title: readTitle = "round-trip" }) => + this.read({ content: readContent, from: readFrom, title: readTitle }), + write: ({ model: writeModel, to: writeTo, title: writeTitle, options: writeOptions }) => + this.write({ model: writeModel, to: writeTo, title: writeTitle, options: writeOptions }), + prepareConversionModel: (args) => this.prepareConversionModel(args), + getAllowedOutputFormats, + getRouteDetails: (fromArg, toArg) => this.getRouteDetails(fromArg, toArg), + }; + } + + _wrapWithRepairCycle({ model, output, ctx, content, fromFormat, toFormat, fileName, options }) { + let cycle; + try { + cycle = defaultRepairEngine.runCycle({ model, output, ctx }); + } catch (error) { + const repairWarning = createWarning( + "info", + "REPAIR_CYCLE_FAILED", + `Repair cycle skipped due to internal error: ${error?.code || error?.message || "unknown"}.`, + { cause: error?.code || "unknown" }, + ); + const audited = ensureDocumentAudit({ + ...model, + metadata: withWarnings(model.metadata || {}, [repairWarning]), + }, { + content, + reader: fromFormat, + writer: toFormat, + targetFormat: toFormat, + fileName, + options, + }); + return { + ...output, + quality: { + qualityReport: audited.metadata?.qualityReport || null, + modelReview: null, + autoRepair: { attempted: false, error: error?.code || "unknown", finalDecision: "failed-quality-gate" }, + conversion: audited.metadata?.conversion || null, + }, + }; + } + const finalModel = ensureDocumentAudit({ + ...cycle.model, + metadata: { + ...(cycle.model.metadata || {}), + autoRepair: cycle.autoRepair, + modelReview: cycle.modelReview, + }, + }, { + content, + reader: fromFormat, + writer: cycle.autoRepair?.fallbackUsed ? (cycle.autoRepair.fallbackTo || toFormat) : toFormat, + targetFormat: cycle.autoRepair?.fallbackUsed ? (cycle.autoRepair.fallbackTo || toFormat) : toFormat, + fileName, + options, + }); + const baseQualityReport = finalModel.metadata?.qualityReport || {}; + const qualityReport = { + ...baseQualityReport, + repairStatus: cycle.autoRepair?.attempted ? "verified" : "not-attempted", + finalDecision: cycle.autoRepair?.finalDecision || "pending", + }; + return { + ...cycle.output, + quality: { + qualityReport, + modelReview: finalModel.metadata?.modelReview || null, + autoRepair: finalModel.metadata?.autoRepair || null, + conversion: finalModel.metadata?.conversion || null, + }, + }; + } + convert({ content, from, to, title = "document", fileName = "", options = {} }) { + const fromFormat = normalizeFormat(from); + const toFormat = normalizeFormat(to); const model = this.prepareConversionModel({ content, from, to, title, fileName, options }); - return this.write({ model, to, title, options }); + const output = this.write({ model, to, title, options }); + if (options?.repair === false) { + return output; + } + const ctx = this._buildRepairCtx({ content, fromFormat, toFormat, title, fileName, options }); + return this._wrapWithRepairCycle({ model, output, ctx, content, fromFormat, toFormat, fileName, options }); + } + + async convertAsync({ content, from, to, title = "document", fileName = "", options = {} }) { + const fromFormat = normalizeFormat(from); + const toFormat = normalizeFormat(to); + let model = this.prepareConversionModel({ content, from, to, title, fileName, options }); + + if (options?.ocr?.enabled !== false && fromFormat === "png") { + const stage = await import("./ocr/ocr-stage.js"); + model = await stage.runOCRStage(model, { + options, + from: fromFormat, + to: toFormat, + }); + } else if (options?.ocr?.enabled !== false && fromFormat === "pdf") { + const { isScannedPdf } = await import("./ocr/pdf-rasterizer.js"); + const detection = await isScannedPdf(content, options?.ocr || {}); + if (detection.scanned) { + const stage = await import("./ocr/scan-pdf-stage.js"); + model = await stage.runScannedPdfOCRStage(model, { + content, + options, + from: fromFormat, + to: toFormat, + }); + } + } + + const output = this.write({ model, to, title, options }); + if (options?.repair === false) { + return output; + } + const ctx = this._buildRepairCtx({ content, fromFormat, toFormat, title, fileName, options }); + return this._wrapWithRepairCycle({ model, output, ctx, content, fromFormat, toFormat, fileName, options }); } } diff --git a/public/core/model-cache/availability.js b/public/core/model-cache/availability.js new file mode 100644 index 0000000..4987cb9 --- /dev/null +++ b/public/core/model-cache/availability.js @@ -0,0 +1,130 @@ +import { ConversionError } from "../conversion-error.js"; +import { validateModelManifest, summarizeManifest } from "./manifest.js"; + +export const STATUS_NOT_DOWNLOADED = "not-downloaded"; +export const STATUS_IMPORTING = "importing"; +export const STATUS_VERIFYING = "verifying"; +export const STATUS_AVAILABLE = "available"; +export const STATUS_DEGRADED = "degraded"; +export const STATUS_DISABLED = "disabled"; + +export const MODEL_CACHE_STATUSES = Object.freeze([ + STATUS_NOT_DOWNLOADED, + STATUS_IMPORTING, + STATUS_VERIFYING, + STATUS_AVAILABLE, + STATUS_DEGRADED, + STATUS_DISABLED, +]); + +function ensureStatus(status) { + if (!MODEL_CACHE_STATUSES.includes(status)) { + throw new ConversionError(`Unknown model cache status: ${status}`, { + category: "validate", + code: "MODEL_CACHE_STATUS_INVALID", + details: { reason: "unknown-status", status }, + }); + } +} + +export class ModelCacheRegistry { + constructor() { + this._entries = new Map(); + this._listeners = new Set(); + } + + register(manifest) { + validateModelManifest(manifest); + if (this._entries.has(manifest.manifestId)) { + throw new ConversionError(`Model manifest already registered: ${manifest.manifestId}`, { + category: "validate", + code: "MODEL_CACHE_DUPLICATE", + details: { manifestId: manifest.manifestId }, + }); + } + const entry = { + manifest, + status: STATUS_NOT_DOWNLOADED, + detail: { message: "" }, + updatedAt: Date.now(), + }; + this._entries.set(manifest.manifestId, entry); + this._notify({ type: "register", manifestId: manifest.manifestId, status: entry.status }); + return entry; + } + + unregister(manifestId) { + if (!this._entries.has(manifestId)) return false; + this._entries.delete(manifestId); + this._notify({ type: "unregister", manifestId }); + return true; + } + + has(manifestId) { + return this._entries.has(manifestId); + } + + getStatus(manifestId) { + const entry = this._entries.get(manifestId); + if (!entry) return null; + return { + manifestId, + status: entry.status, + detail: { ...entry.detail }, + updatedAt: entry.updatedAt, + summary: summarizeManifest(entry.manifest), + }; + } + + setStatus(manifestId, status, detail = {}) { + const entry = this._entries.get(manifestId); + if (!entry) { + throw new ConversionError(`Unknown manifestId: ${manifestId}`, { + category: "validate", + code: "MODEL_CACHE_UNKNOWN", + details: { manifestId }, + }); + } + ensureStatus(status); + entry.status = status; + entry.detail = { message: "", ...detail }; + entry.updatedAt = Date.now(); + this._notify({ type: "status", manifestId, status, detail: entry.detail }); + return entry; + } + + listManifests() { + return [...this._entries.values()] + .map((entry) => ({ + manifestId: entry.manifest.manifestId, + manifest: entry.manifest, + status: entry.status, + detail: { ...entry.detail }, + updatedAt: entry.updatedAt, + })) + .sort((a, b) => a.manifestId.localeCompare(b.manifestId)); + } + + reset() { + this._entries.clear(); + this._notify({ type: "reset" }); + } + + onChange(callback) { + if (typeof callback !== "function") return () => {}; + this._listeners.add(callback); + return () => this._listeners.delete(callback); + } + + _notify(event) { + for (const listener of this._listeners) { + try { + listener(event); + } catch (error) { + // listener errors must not break the registry + } + } + } +} + +export const defaultModelCache = new ModelCacheRegistry(); diff --git a/public/core/model-cache/cache-paths.js b/public/core/model-cache/cache-paths.js new file mode 100644 index 0000000..fcec4fb --- /dev/null +++ b/public/core/model-cache/cache-paths.js @@ -0,0 +1,85 @@ +import { ConversionError } from "../conversion-error.js"; +import { MODEL_ENGINES, MODEL_TASKS } from "./manifest.js"; + +export const MODEL_CACHE_ROOT = "model-cache"; + +function assertSlugSafe(value, field) { + if (typeof value !== "string" || value.trim().length === 0) { + throw new ConversionError(`Cache path field missing or empty: ${field}`, { + category: "validate", + code: "MODEL_CACHE_PATH_INVALID", + details: { reason: "missing-field", field }, + }); + } + if (!/^[a-z0-9][a-z0-9_.\-]*$/i.test(value)) { + throw new ConversionError(`Cache path field contains unsupported characters: ${field}=${value}`, { + category: "validate", + code: "MODEL_CACHE_PATH_INVALID", + details: { reason: "unsafe-slug", field, value }, + }); + } +} + +export function getCacheKey({ task, engine, modelVersion } = {}) { + assertSlugSafe(task, "task"); + assertSlugSafe(engine, "engine"); + assertSlugSafe(modelVersion, "modelVersion"); + if (!MODEL_TASKS.includes(task)) { + throw new ConversionError(`Unknown cache task: ${task}`, { + category: "validate", + code: "MODEL_CACHE_PATH_INVALID", + details: { reason: "unknown-task", task }, + }); + } + if (!MODEL_ENGINES.includes(engine)) { + throw new ConversionError(`Unknown cache engine: ${engine}`, { + category: "validate", + code: "MODEL_CACHE_PATH_INVALID", + details: { reason: "unknown-engine", engine }, + }); + } + return `${task}/${engine}/${modelVersion}`; +} + +export function getCacheDirectory(parts) { + return `${MODEL_CACHE_ROOT}/${getCacheKey(parts)}`; +} + +export function parseCacheKey(key) { + if (typeof key !== "string" || key.trim().length === 0) { + throw new ConversionError("parseCacheKey requires a non-empty key string.", { + category: "validate", + code: "MODEL_CACHE_PATH_INVALID", + details: { reason: "empty-key" }, + }); + } + const segments = key.split("/").filter(Boolean); + if (segments.length !== 3) { + throw new ConversionError(`Cache key must have 3 segments (task/engine/modelVersion): ${key}`, { + category: "validate", + code: "MODEL_CACHE_PATH_INVALID", + details: { reason: "segment-count", segments: segments.length }, + }); + } + const [task, engine, modelVersion] = segments; + getCacheKey({ task, engine, modelVersion }); + return { task, engine, modelVersion }; +} + +export function getCacheFilePath(parts, fileName) { + if (typeof fileName !== "string" || fileName.trim().length === 0) { + throw new ConversionError("getCacheFilePath requires a non-empty fileName.", { + category: "validate", + code: "MODEL_CACHE_PATH_INVALID", + details: { reason: "empty-file-name" }, + }); + } + if (fileName.includes("..") || fileName.startsWith("/") || fileName.includes("\\")) { + throw new ConversionError(`Cache file name must be relative and safe: ${fileName}`, { + category: "validate", + code: "MODEL_CACHE_PATH_INVALID", + details: { reason: "unsafe-file-name", fileName }, + }); + } + return `${getCacheDirectory(parts)}/${fileName.replace(/^\/+/, "")}`; +} diff --git a/public/core/model-cache/checksum.js b/public/core/model-cache/checksum.js new file mode 100644 index 0000000..dd2c567 --- /dev/null +++ b/public/core/model-cache/checksum.js @@ -0,0 +1,61 @@ +import { ConversionError } from "../conversion-error.js"; + +function ensureSubtleCrypto() { + const subtle = globalThis.crypto?.subtle; + if (!subtle || typeof subtle.digest !== "function") { + throw new ConversionError("crypto.subtle.digest is not available in the current runtime.", { + category: "validate", + code: "MODEL_CHECKSUM_UNSUPPORTED", + details: { reason: "subtle-crypto-missing" }, + }); + } + return subtle; +} + +function coerceToArrayBuffer(input) { + if (input instanceof ArrayBuffer) return input; + if (ArrayBuffer.isView(input)) { + return input.buffer.slice(input.byteOffset, input.byteOffset + input.byteLength); + } + if (typeof input === "string") { + const encoder = new TextEncoder(); + return encoder.encode(input).buffer; + } + throw new ConversionError("sha256Hex requires ArrayBuffer, TypedArray, or string input.", { + category: "validate", + code: "MODEL_CHECKSUM_INVALID_INPUT", + details: { reason: "unsupported-input-type" }, + }); +} + +function bytesToHex(bytes) { + const out = new Array(bytes.length); + for (let i = 0; i < bytes.length; i += 1) { + out[i] = bytes[i].toString(16).padStart(2, "0"); + } + return out.join(""); +} + +export async function sha256Hex(input) { + const subtle = ensureSubtleCrypto(); + const buffer = coerceToArrayBuffer(input); + const digest = await subtle.digest("SHA-256", buffer); + return bytesToHex(new Uint8Array(digest)); +} + +export async function verifyChecksum(input, expectedDigest) { + if (typeof expectedDigest !== "string" || expectedDigest.length === 0) { + throw new ConversionError("verifyChecksum requires a non-empty expected digest.", { + category: "validate", + code: "MODEL_CHECKSUM_INVALID_INPUT", + details: { reason: "missing-expected-digest" }, + }); + } + const actual = await sha256Hex(input); + const normalizedExpected = expectedDigest.trim().toLowerCase(); + return { + ok: actual === normalizedExpected, + actual, + expected: normalizedExpected, + }; +} diff --git a/public/core/model-cache/manifest.js b/public/core/model-cache/manifest.js new file mode 100644 index 0000000..81c2a14 --- /dev/null +++ b/public/core/model-cache/manifest.js @@ -0,0 +1,192 @@ +import { ConversionError } from "../conversion-error.js"; + +export const MODEL_MANIFEST_SCHEMA_VERSION = "trans2former.model-manifest.v1"; + +export const MODEL_TASKS = Object.freeze([ + "ocr-text", + "ocr-layout", + "ocr-table", + "quality-reviewer", +]); + +export const MODEL_ENGINES = Object.freeze([ + "tesseract", + "paddleocr", + "paddleocr-vl", + "mineru", + "custom", +]); + +export const MODEL_QUANTIZATIONS = Object.freeze(["fp32", "fp16", "int8", "none"]); + +export const FALLBACK_STRATEGIES = Object.freeze([ + "skip-task", + "use-degraded-route", + "fail-quality-gate", +]); + +function isPlainObject(value) { + return Boolean(value) && typeof value === "object" && !Array.isArray(value); +} + +function isNonEmptyString(value) { + return typeof value === "string" && value.trim().length > 0; +} + +function freezeDeep(value) { + if (Array.isArray(value)) { + for (const item of value) freezeDeep(item); + return Object.freeze(value); + } + if (isPlainObject(value)) { + for (const key of Object.keys(value)) freezeDeep(value[key]); + return Object.freeze(value); + } + return value; +} + +export function createModelManifest({ + manifestId, + task, + engine, + modelVersion, + bundleSize, + quantization = "none", + minMemoryMB = 0, + sources = [], + checksums = { algorithm: "SHA-256", digest: "", perFile: {} }, + fallback = { onFailure: "skip-task", message: "" }, + ui = { label: "", description: "", enableHint: "" }, +} = {}) { + const manifest = { + schemaVersion: MODEL_MANIFEST_SCHEMA_VERSION, + manifestId, + task, + engine, + modelVersion, + bundleSize, + quantization, + minMemoryMB, + sources: Array.isArray(sources) ? sources.map((entry) => ({ ...entry })) : sources, + checksums: { + algorithm: checksums?.algorithm || "SHA-256", + digest: checksums?.digest || "", + perFile: { ...(checksums?.perFile || {}) }, + }, + fallback: { + onFailure: fallback?.onFailure || "skip-task", + message: fallback?.message || "", + }, + ui: { + label: ui?.label || "", + description: ui?.description || "", + enableHint: ui?.enableHint || "", + }, + }; + validateModelManifest(manifest); + return freezeDeep(manifest); +} + +export function validateModelManifest(manifest) { + if (!isPlainObject(manifest)) { + throw new ConversionError("Model manifest must be an object.", { + category: "validate", + code: "MODEL_MANIFEST_INVALID", + details: { reason: "not-an-object" }, + }); + } + if (manifest.schemaVersion !== MODEL_MANIFEST_SCHEMA_VERSION) { + throw new ConversionError(`Unsupported manifest schemaVersion: ${manifest.schemaVersion}`, { + category: "validate", + code: "MODEL_MANIFEST_INVALID", + details: { reason: "schema-version", expected: MODEL_MANIFEST_SCHEMA_VERSION }, + }); + } + for (const field of ["manifestId", "task", "engine", "modelVersion"]) { + if (!isNonEmptyString(manifest[field])) { + throw new ConversionError(`Manifest field missing or empty: ${field}`, { + category: "validate", + code: "MODEL_MANIFEST_INVALID", + details: { reason: "missing-field", field }, + }); + } + } + if (!MODEL_TASKS.includes(manifest.task)) { + throw new ConversionError(`Unknown manifest task: ${manifest.task}`, { + category: "validate", + code: "MODEL_MANIFEST_INVALID", + details: { reason: "unknown-task", task: manifest.task }, + }); + } + if (!MODEL_ENGINES.includes(manifest.engine)) { + throw new ConversionError(`Unknown manifest engine: ${manifest.engine}`, { + category: "validate", + code: "MODEL_MANIFEST_INVALID", + details: { reason: "unknown-engine", engine: manifest.engine }, + }); + } + if (typeof manifest.bundleSize !== "number" || !Number.isFinite(manifest.bundleSize) || manifest.bundleSize <= 0) { + throw new ConversionError("Manifest bundleSize must be a positive number.", { + category: "validate", + code: "MODEL_MANIFEST_INVALID", + details: { reason: "invalid-bundle-size", bundleSize: manifest.bundleSize }, + }); + } + if (manifest.quantization && !MODEL_QUANTIZATIONS.includes(manifest.quantization)) { + throw new ConversionError(`Unknown quantization: ${manifest.quantization}`, { + category: "validate", + code: "MODEL_MANIFEST_INVALID", + details: { reason: "unknown-quantization", quantization: manifest.quantization }, + }); + } + if (typeof manifest.minMemoryMB !== "number" || !Number.isFinite(manifest.minMemoryMB) || manifest.minMemoryMB < 0) { + throw new ConversionError("Manifest minMemoryMB must be a non-negative number.", { + category: "validate", + code: "MODEL_MANIFEST_INVALID", + details: { reason: "invalid-min-memory" }, + }); + } + if (!isPlainObject(manifest.checksums) || manifest.checksums.algorithm !== "SHA-256") { + throw new ConversionError("Manifest checksums.algorithm must equal 'SHA-256'.", { + category: "validate", + code: "MODEL_MANIFEST_INVALID", + details: { reason: "invalid-checksum-algorithm" }, + }); + } + if (!isNonEmptyString(manifest.checksums.digest)) { + throw new ConversionError("Manifest checksums.digest must be a non-empty SHA-256 hex string.", { + category: "validate", + code: "MODEL_MANIFEST_INVALID", + details: { reason: "missing-digest" }, + }); + } + if (!isPlainObject(manifest.fallback) || !FALLBACK_STRATEGIES.includes(manifest.fallback.onFailure)) { + throw new ConversionError(`Manifest fallback.onFailure must be one of ${FALLBACK_STRATEGIES.join(", ")}.`, { + category: "validate", + code: "MODEL_MANIFEST_INVALID", + details: { reason: "invalid-fallback", onFailure: manifest.fallback?.onFailure }, + }); + } + if (!Array.isArray(manifest.sources)) { + throw new ConversionError("Manifest sources must be an array.", { + category: "validate", + code: "MODEL_MANIFEST_INVALID", + details: { reason: "invalid-sources" }, + }); + } + return manifest; +} + +export function summarizeManifest(manifest) { + if (!isPlainObject(manifest)) return null; + return { + manifestId: manifest.manifestId, + task: manifest.task, + engine: manifest.engine, + modelVersion: manifest.modelVersion, + bundleSize: manifest.bundleSize, + quantization: manifest.quantization || "none", + minMemoryMB: manifest.minMemoryMB || 0, + fallback: manifest.fallback?.onFailure || "skip-task", + }; +} diff --git a/public/core/model-cache/ui-text.js b/public/core/model-cache/ui-text.js new file mode 100644 index 0000000..8755635 --- /dev/null +++ b/public/core/model-cache/ui-text.js @@ -0,0 +1,62 @@ +import { MODEL_TASKS } from "./manifest.js"; + +export const FIRST_ENABLE_HINTS = Object.freeze({ + "ocr-text": "首次启用文字 OCR 时会从本地导入识别模型资源到 model-cache,下载完成后所有识别全部在本机执行。", + "ocr-layout": "首次启用版面分析时需要本地导入对应模型资源;resource 路径与体积会在确认前展示。", + "ocr-table": "首次启用表格恢复时需要本地导入表格识别模型;完成 checksum 校验后才会激活该路径。", + "quality-reviewer": "首次启用质量审核模型时本地导入模型资源,所有审核动作仅生成结构化 RepairAction,不直接替换文件字节。", +}); + +export const OFFLINE_FALLBACK_HINTS = Object.freeze({ + "ocr-text": "离线或模型缺失时,文字 OCR 路径不可用;基础文本提取保留,结果会附带 OCR_UNAVAILABLE warning。", + "ocr-layout": "离线或模型缺失时,版面恢复退化为坐标启发式,输出会标注 LAYOUT_DEGRADED。", + "ocr-table": "离线或模型缺失时,表格恢复跳过模型增强,仅保留确定性 reader 结果。", + "quality-reviewer": "离线或模型缺失时,Repair Engine 仅运行规则驱动 validator,模型审核结果留空。", +}); + +export const CLEAR_CACHE_HINTS = Object.freeze({ + "ocr-text": "清理后下次启用 OCR 时需要重新导入模型资源。已生成的转换结果不受影响。", + "ocr-layout": "清理版面模型不会影响已完成的转换;后续版面恢复将退回坐标启发式。", + "ocr-table": "清理表格模型后再次启用时需要重新导入资源;不影响历史输出。", + "quality-reviewer": "清理质量审核模型后,Repair Engine 仅保留规则驱动 validator;下次启用前不再有模型审核证据。", +}); + +const STATUS_LABELS = Object.freeze({ + "not-downloaded": "未启用", + "importing": "导入中", + "verifying": "校验中", + "available": "已就绪", + "degraded": "降级", + "disabled": "已禁用", +}); + +const TASK_LABELS = Object.freeze({ + "ocr-text": "文字 OCR", + "ocr-layout": "版面分析", + "ocr-table": "表格恢复", + "quality-reviewer": "质量审核", +}); + +export function getFirstEnableHint(task) { + return FIRST_ENABLE_HINTS[task] || "首次启用时本地导入模型资源到 model-cache,下载完成后在本机执行。"; +} + +export function getOfflineFallbackHint(task) { + return OFFLINE_FALLBACK_HINTS[task] || "离线或模型缺失时,对应能力按 fallback 策略降级或跳过。"; +} + +export function getClearCacheHint(task) { + return CLEAR_CACHE_HINTS[task] || "清理后下次启用时需要重新导入模型资源。"; +} + +export function getStatusLabel(status) { + return STATUS_LABELS[status] || status || "未知"; +} + +export function getTaskLabel(task) { + return TASK_LABELS[task] || task || "未命名任务"; +} + +export function listKnownTaskLabels() { + return MODEL_TASKS.map((task) => ({ task, label: getTaskLabel(task) })); +} diff --git a/public/core/models/fixed-layout.js b/public/core/models/fixed-layout.js index 853d41d..02ccfb7 100644 --- a/public/core/models/fixed-layout.js +++ b/public/core/models/fixed-layout.js @@ -22,6 +22,7 @@ export function createPage({ annotations = [], signatures = [], assets = [], + readingOrderHint = "", } = {}) { return { pageNumber: Number(pageNumber) || 0, @@ -41,6 +42,7 @@ export function createPage({ assetId: String(asset?.assetId || ""), bbox: asset?.bbox ? createBbox(asset.bbox) : null, })), + readingOrderHint: String(readingOrderHint || ""), }; } @@ -50,6 +52,7 @@ export function createTextRun({ fontName = "", fontSize = 0, fontWeight = "", + confidence = 0, } = {}) { return { text: String(text ?? ""), @@ -57,6 +60,7 @@ export function createTextRun({ fontName: String(fontName || ""), fontSize: Number(fontSize) || 0, fontWeight: String(fontWeight || ""), + confidence: Number.isFinite(Number(confidence)) ? Math.max(0, Math.min(1, Number(confidence))) : 0, }; } diff --git a/public/core/ocr/indexeddb-storage.js b/public/core/ocr/indexeddb-storage.js new file mode 100644 index 0000000..730b178 --- /dev/null +++ b/public/core/ocr/indexeddb-storage.js @@ -0,0 +1,181 @@ +import { ConversionError } from "../conversion-error.js"; + +const DEFAULT_DB_NAME = "trans2former-ocr-cache"; +const DB_VERSION = 1; +const STORE_TESSDATA = "tessdata"; +const STORE_METADATA = "metadata"; + +function ensureString(value, field) { + if (typeof value !== "string" || value.length === 0) { + throw new ConversionError(`OCR storage ${field} must be a non-empty string.`, { + category: "validate", + code: "OCR_STORAGE_INVALID_KEY", + details: { reason: "missing-field", field }, + }); + } +} + +function ensureBuffer(value, field) { + if (value instanceof ArrayBuffer) return value; + if (ArrayBuffer.isView(value)) { + return value.buffer.slice(value.byteOffset, value.byteOffset + value.byteLength); + } + throw new ConversionError(`OCR storage ${field} must be ArrayBuffer or TypedArray.`, { + category: "validate", + code: "OCR_STORAGE_INVALID_VALUE", + details: { reason: "unsupported-input-type", field }, + }); +} + +function awaitRequest(request) { + return new Promise((resolve, reject) => { + request.onsuccess = () => resolve(request.result); + request.onerror = () => reject(request.error || new Error("IndexedDB request failed")); + }); +} + +function awaitTransaction(tx) { + return new Promise((resolve, reject) => { + tx.oncomplete = () => resolve(); + tx.onabort = () => reject(tx.error || new Error("IndexedDB transaction aborted")); + tx.onerror = () => reject(tx.error || new Error("IndexedDB transaction failed")); + }); +} + +function openDatabase(dbName) { + return new Promise((resolve, reject) => { + const request = globalThis.indexedDB.open(dbName, DB_VERSION); + request.onupgradeneeded = (event) => { + const db = request.result; + if (!db.objectStoreNames.contains(STORE_TESSDATA)) { + db.createObjectStore(STORE_TESSDATA); + } + if (!db.objectStoreNames.contains(STORE_METADATA)) { + db.createObjectStore(STORE_METADATA); + } + }; + request.onsuccess = () => resolve(request.result); + request.onerror = () => reject(request.error || new Error("IndexedDB open failed")); + request.onblocked = () => reject(new Error("IndexedDB upgrade blocked by another connection")); + }); +} + +function wrapIdbError(operation, error) { + return new ConversionError(`IndexedDB ${operation} failed: ${error?.message || error}`, { + category: "convert", + code: "OCR_STORAGE_IDB_ERROR", + details: { operation, cause: String(error?.name || error?.message || "unknown") }, + }); +} + +export class IndexedDBStorage { + constructor(dbName = DEFAULT_DB_NAME) { + this._dbName = dbName; + this._dbPromise = null; + } + + async _db() { + if (!this._dbPromise) { + this._dbPromise = openDatabase(this._dbName).catch((error) => { + this._dbPromise = null; + throw wrapIdbError("open", error); + }); + } + return this._dbPromise; + } + + async has(key) { + ensureString(key, "key"); + const db = await this._db(); + try { + const tx = db.transaction(STORE_METADATA, "readonly"); + const store = tx.objectStore(STORE_METADATA); + const value = await awaitRequest(store.getKey(key)); + return value !== undefined; + } catch (error) { + throw wrapIdbError("has", error); + } + } + + async get(key) { + ensureString(key, "key"); + const db = await this._db(); + try { + const tx = db.transaction(STORE_TESSDATA, "readonly"); + const store = tx.objectStore(STORE_TESSDATA); + const value = await awaitRequest(store.get(key)); + if (value === undefined) return null; + if (value instanceof ArrayBuffer) return value.slice(0); + if (ArrayBuffer.isView(value)) return value.buffer.slice(value.byteOffset, value.byteOffset + value.byteLength); + return null; + } catch (error) { + throw wrapIdbError("get", error); + } + } + + async put(key, value, meta = {}) { + ensureString(key, "key"); + const buffer = ensureBuffer(value, "value"); + const db = await this._db(); + try { + const tx = db.transaction([STORE_TESSDATA, STORE_METADATA], "readwrite"); + tx.objectStore(STORE_TESSDATA).put(buffer, key); + tx.objectStore(STORE_METADATA).put( + { + size: buffer.byteLength, + sha256: typeof meta?.sha256 === "string" ? meta.sha256 : "", + updatedAt: Date.now(), + }, + key, + ); + await awaitTransaction(tx); + } catch (error) { + throw wrapIdbError("put", error); + } + } + + async delete(key) { + ensureString(key, "key"); + const db = await this._db(); + try { + const tx = db.transaction([STORE_TESSDATA, STORE_METADATA], "readwrite"); + tx.objectStore(STORE_TESSDATA).delete(key); + tx.objectStore(STORE_METADATA).delete(key); + await awaitTransaction(tx); + return true; + } catch (error) { + throw wrapIdbError("delete", error); + } + } + + async list() { + const db = await this._db(); + try { + const tx = db.transaction(STORE_METADATA, "readonly"); + const store = tx.objectStore(STORE_METADATA); + const keys = await awaitRequest(store.getAllKeys()); + const values = await awaitRequest(store.getAll()); + const entries = keys.map((key, index) => ({ + key: String(key), + size: values[index]?.size ?? 0, + sha256: values[index]?.sha256 ?? "", + updatedAt: values[index]?.updatedAt ?? 0, + })); + return entries.sort((a, b) => a.key.localeCompare(b.key)); + } catch (error) { + throw wrapIdbError("list", error); + } + } + + async clear() { + const db = await this._db(); + try { + const tx = db.transaction([STORE_TESSDATA, STORE_METADATA], "readwrite"); + tx.objectStore(STORE_TESSDATA).clear(); + tx.objectStore(STORE_METADATA).clear(); + await awaitTransaction(tx); + } catch (error) { + throw wrapIdbError("clear", error); + } + } +} diff --git a/public/core/ocr/ocr-bootstrap.js b/public/core/ocr/ocr-bootstrap.js new file mode 100644 index 0000000..3c09486 --- /dev/null +++ b/public/core/ocr/ocr-bootstrap.js @@ -0,0 +1,51 @@ +import { + defaultModelCache, + STATUS_DISABLED, +} from "../model-cache/availability.js"; +import { createModelManifest } from "../model-cache/manifest.js"; +import { defaultOCRRegistry } from "./ocr-engine.js"; +import { placeholderOCREngine, PLACEHOLDER_OCR_MANIFEST_ID } from "./placeholder-engine.js"; + +let bootstrapped = false; + +export function ensureOCRBootstrap() { + if (bootstrapped) return; + bootstrapped = true; + + if (!defaultOCRRegistry.has(placeholderOCREngine.id)) { + defaultOCRRegistry.register(placeholderOCREngine); + } + + if (!defaultModelCache.has(PLACEHOLDER_OCR_MANIFEST_ID)) { + const manifest = createModelManifest({ + manifestId: PLACEHOLDER_OCR_MANIFEST_ID, + task: "ocr-text", + engine: "custom", + modelVersion: "0.1.0", + bundleSize: 1, + quantization: "none", + minMemoryMB: 0, + sources: [{ kind: "vendor-bundle", path: "placeholder" }], + checksums: { + algorithm: "SHA-256", + digest: "0".repeat(64), + perFile: {}, + }, + fallback: { + onFailure: "use-degraded-route", + message: "P9-A 占位 manifest,等待真实模型接入", + }, + ui: { + label: "OCR 文字识别 · 占位", + description: "P9-A.1 仅注册契约;接入真实 Tesseract.js 留给 P9-A.2", + enableHint: "占位 manifest,不会触发任何下载", + }, + }); + defaultModelCache.register(manifest); + defaultModelCache.setStatus(PLACEHOLDER_OCR_MANIFEST_ID, STATUS_DISABLED, { + message: "P9-A.1 占位条目;启用真实 OCR 留给 P9-A.2", + }); + } +} + +ensureOCRBootstrap(); diff --git a/public/core/ocr/ocr-engine.js b/public/core/ocr/ocr-engine.js new file mode 100644 index 0000000..b203f0a --- /dev/null +++ b/public/core/ocr/ocr-engine.js @@ -0,0 +1,135 @@ +import { ConversionError } from "../conversion-error.js"; + +const REQUIRED_FIELDS = ["id", "taskCapabilities", "isAvailable", "recognize"]; + +function isPlainObject(value) { + return Boolean(value) && typeof value === "object" && !Array.isArray(value); +} + +function validateEngineShape(engine) { + if (!isPlainObject(engine)) { + throw new ConversionError("OCR engine must be an object.", { + category: "validate", + code: "OCR_ENGINE_INVALID", + details: { reason: "not-an-object" }, + }); + } + for (const field of REQUIRED_FIELDS) { + if (engine[field] === undefined || engine[field] === null) { + throw new ConversionError(`OCR engine missing required field: ${field}`, { + category: "validate", + code: "OCR_ENGINE_INVALID", + details: { reason: "missing-field", field }, + }); + } + } + if (typeof engine.id !== "string" || engine.id.trim().length === 0) { + throw new ConversionError("OCR engine id must be a non-empty string.", { + category: "validate", + code: "OCR_ENGINE_INVALID", + details: { reason: "invalid-id" }, + }); + } + if (!Array.isArray(engine.taskCapabilities) || engine.taskCapabilities.length === 0) { + throw new ConversionError("OCR engine taskCapabilities must be a non-empty array.", { + category: "validate", + code: "OCR_ENGINE_INVALID", + details: { reason: "invalid-task-capabilities" }, + }); + } + if (typeof engine.isAvailable !== "function") { + throw new ConversionError("OCR engine isAvailable must be a function.", { + category: "validate", + code: "OCR_ENGINE_INVALID", + details: { reason: "invalid-isAvailable" }, + }); + } + if (typeof engine.recognize !== "function") { + throw new ConversionError("OCR engine recognize must be a function.", { + category: "validate", + code: "OCR_ENGINE_INVALID", + details: { reason: "invalid-recognize" }, + }); + } +} + +export class OCREngineRegistry { + constructor() { + this._engines = new Map(); + this._order = []; + this._listeners = new Set(); + } + + register(engine) { + validateEngineShape(engine); + if (this._engines.has(engine.id)) { + throw new ConversionError(`OCR engine already registered: ${engine.id}`, { + category: "validate", + code: "OCR_ENGINE_DUPLICATE", + details: { engineId: engine.id }, + }); + } + this._engines.set(engine.id, engine); + this._order.push(engine.id); + this._notify({ type: "register", engineId: engine.id }); + return engine; + } + + unregister(id) { + if (!this._engines.has(id)) return false; + this._engines.delete(id); + this._order = this._order.filter((entry) => entry !== id); + this._notify({ type: "unregister", engineId: id }); + return true; + } + + has(id) { + return this._engines.has(id); + } + + list() { + return this._order.map((id) => this._engines.get(id)).filter(Boolean); + } + + pickById(id) { + return this._engines.get(id) || null; + } + + pickForTask(task) { + const candidates = this.list().filter((engine) => engine.taskCapabilities.includes(task)); + if (candidates.length === 0) return null; + const available = candidates.find((engine) => { + try { + return engine.isAvailable() === true; + } catch (error) { + return false; + } + }); + if (available) return available; + return candidates[candidates.length - 1]; + } + + onChange(callback) { + if (typeof callback !== "function") return () => {}; + this._listeners.add(callback); + return () => this._listeners.delete(callback); + } + + reset() { + this._engines.clear(); + this._order = []; + this._notify({ type: "reset" }); + } + + _notify(event) { + for (const listener of this._listeners) { + try { + listener(event); + } catch (error) { + // listener errors must not break the registry + } + } + } +} + +export const defaultOCRRegistry = new OCREngineRegistry(); diff --git a/public/core/ocr/ocr-result.js b/public/core/ocr/ocr-result.js new file mode 100644 index 0000000..a8affce --- /dev/null +++ b/public/core/ocr/ocr-result.js @@ -0,0 +1,180 @@ +import { ConversionError } from "../conversion-error.js"; + +export const OCR_RESULT_SCHEMA_VERSION = "trans2former.ocr-result.v1"; + +export const OCR_LANGUAGES = Object.freeze([ + "zh-CN", + "zh-TW", + "en", + "ja", + "ko", + "auto", +]); + +function isPlainObject(value) { + return Boolean(value) && typeof value === "object" && !Array.isArray(value); +} + +function freezeDeep(value) { + if (Array.isArray(value)) { + for (const item of value) freezeDeep(item); + return Object.freeze(value); + } + if (isPlainObject(value)) { + for (const key of Object.keys(value)) freezeDeep(value[key]); + return Object.freeze(value); + } + return value; +} + +function isNonNegativeNumber(value) { + return typeof value === "number" && Number.isFinite(value) && value >= 0; +} + +function isUnitInterval(value) { + return typeof value === "number" && Number.isFinite(value) && value >= 0 && value <= 1; +} + +export function createOCRResult({ + language = "auto", + pages = [], + fullText = "", + averageConfidence = 0, + runtimeMs = 0, + engine = "", + modelVersion = "", + warnings = [], +} = {}) { + const result = { + schemaVersion: OCR_RESULT_SCHEMA_VERSION, + language, + pages: Array.isArray(pages) ? pages.map((page) => ({ + pageIndex: page?.pageIndex ?? 0, + width: page?.width ?? 0, + height: page?.height ?? 0, + lines: Array.isArray(page?.lines) ? page.lines.map((line) => ({ + text: String(line?.text ?? ""), + confidence: line?.confidence ?? 0, + bbox: line?.bbox + ? { + x: line.bbox.x ?? 0, + y: line.bbox.y ?? 0, + w: line.bbox.w ?? 0, + h: line.bbox.h ?? 0, + } + : null, + })) : [], + })) : pages, + fullText: String(fullText || ""), + averageConfidence, + runtimeMs, + engine: String(engine || ""), + modelVersion: String(modelVersion || ""), + warnings: Array.isArray(warnings) ? warnings.map((w) => ({ ...w })) : warnings, + }; + validateOCRResult(result); + return freezeDeep(result); +} + +export function validateOCRResult(result) { + if (!isPlainObject(result)) { + throw new ConversionError("OCR result must be an object.", { + category: "validate", + code: "OCR_RESULT_INVALID", + details: { reason: "not-an-object" }, + }); + } + if (result.schemaVersion !== OCR_RESULT_SCHEMA_VERSION) { + throw new ConversionError(`Unsupported OCR result schemaVersion: ${result.schemaVersion}`, { + category: "validate", + code: "OCR_RESULT_INVALID", + details: { reason: "schema-version", expected: OCR_RESULT_SCHEMA_VERSION }, + }); + } + if (!OCR_LANGUAGES.includes(result.language)) { + throw new ConversionError(`Unknown OCR language: ${result.language}`, { + category: "validate", + code: "OCR_RESULT_INVALID", + details: { reason: "unknown-language", language: result.language }, + }); + } + if (!Array.isArray(result.pages)) { + throw new ConversionError("OCR result pages must be an array.", { + category: "validate", + code: "OCR_RESULT_INVALID", + details: { reason: "invalid-pages" }, + }); + } + for (let i = 0; i < result.pages.length; i += 1) { + const page = result.pages[i]; + if (!isPlainObject(page)) { + throw new ConversionError(`OCR page ${i} must be an object.`, { + category: "validate", + code: "OCR_RESULT_INVALID", + details: { reason: "invalid-page", index: i }, + }); + } + if (!isNonNegativeNumber(page.pageIndex) || !isNonNegativeNumber(page.width) || !isNonNegativeNumber(page.height)) { + throw new ConversionError(`OCR page ${i} pageIndex/width/height must be non-negative numbers.`, { + category: "validate", + code: "OCR_RESULT_INVALID", + details: { reason: "invalid-page-geometry", index: i }, + }); + } + if (!Array.isArray(page.lines)) { + throw new ConversionError(`OCR page ${i} lines must be an array.`, { + category: "validate", + code: "OCR_RESULT_INVALID", + details: { reason: "invalid-lines", index: i }, + }); + } + for (let j = 0; j < page.lines.length; j += 1) { + const line = page.lines[j]; + if (typeof line?.text !== "string") { + throw new ConversionError(`OCR line ${i}.${j} text must be a string.`, { + category: "validate", + code: "OCR_RESULT_INVALID", + details: { reason: "invalid-line-text", page: i, line: j }, + }); + } + if (!isUnitInterval(line.confidence)) { + throw new ConversionError(`OCR line ${i}.${j} confidence must be in [0, 1].`, { + category: "validate", + code: "OCR_RESULT_INVALID", + details: { reason: "invalid-line-confidence", page: i, line: j }, + }); + } + } + } + if (!isUnitInterval(result.averageConfidence)) { + throw new ConversionError("OCR averageConfidence must be in [0, 1].", { + category: "validate", + code: "OCR_RESULT_INVALID", + details: { reason: "invalid-average-confidence" }, + }); + } + if (!isNonNegativeNumber(result.runtimeMs)) { + throw new ConversionError("OCR runtimeMs must be non-negative.", { + category: "validate", + code: "OCR_RESULT_INVALID", + details: { reason: "invalid-runtime" }, + }); + } + return result; +} + +export function summarizeOCRResult(result) { + if (!isPlainObject(result)) return null; + const pages = Array.isArray(result.pages) ? result.pages : []; + const lineCount = pages.reduce((sum, page) => sum + (Array.isArray(page.lines) ? page.lines.length : 0), 0); + return { + pageCount: pages.length, + lineCount, + averageConfidence: result.averageConfidence ?? 0, + fullTextLength: typeof result.fullText === "string" ? result.fullText.length : 0, + engine: result.engine || "", + modelVersion: result.modelVersion || "", + runtimeMs: result.runtimeMs ?? 0, + language: result.language || "auto", + }; +} diff --git a/public/core/ocr/ocr-stage.js b/public/core/ocr/ocr-stage.js new file mode 100644 index 0000000..7565989 --- /dev/null +++ b/public/core/ocr/ocr-stage.js @@ -0,0 +1,37 @@ +import { defaultOCRRegistry } from "./ocr-engine.js"; +import { enhanceWithOCR } from "./png-ocr.js"; +import { createOCREngineFailedWarning } from "./ocr-warnings.js"; +import { withWarnings } from "../warnings.js"; + +const DEFAULT_LANGUAGE = "chi_sim"; + +function shouldSkip(ctx) { + return Boolean(ctx?.options?.ocr?.enabled === false); +} + +export async function runOCRStage(model, ctx = {}) { + if (shouldSkip(ctx)) return model; + const registry = ctx.ocrRegistry || defaultOCRRegistry; + const engine = ctx.ocrEngine || registry.pickForTask("ocr-text"); + if (!engine) return model; + try { + const enhanced = await enhanceWithOCR(model, { engine, registry }); + return enhanced; + } catch (error) { + return { + ...model, + metadata: withWarnings(model.metadata || {}, [ + createOCREngineFailedWarning({ + engineId: engine?.id || "unknown", + manifestId: engine?.manifestId || "", + reason: error?.code || "stage-failed", + cause: error?.message || String(error), + }), + ]), + }; + } +} + +export function getDefaultOCRLanguage(ctx = {}) { + return ctx?.options?.ocr?.language || DEFAULT_LANGUAGE; +} diff --git a/public/core/ocr/ocr-storage.js b/public/core/ocr/ocr-storage.js new file mode 100644 index 0000000..bdf423d --- /dev/null +++ b/public/core/ocr/ocr-storage.js @@ -0,0 +1,118 @@ +import { ConversionError } from "../conversion-error.js"; + +function ensureString(value, field) { + if (typeof value !== "string" || value.length === 0) { + throw new ConversionError(`OCR storage ${field} must be a non-empty string.`, { + category: "validate", + code: "OCR_STORAGE_INVALID_KEY", + details: { reason: "missing-field", field }, + }); + } +} + +function ensureBuffer(value, field) { + if (value instanceof ArrayBuffer) return value; + if (ArrayBuffer.isView(value)) { + return value.buffer.slice(value.byteOffset, value.byteOffset + value.byteLength); + } + throw new ConversionError(`OCR storage ${field} must be ArrayBuffer or TypedArray.`, { + category: "validate", + code: "OCR_STORAGE_INVALID_VALUE", + details: { reason: "unsupported-input-type", field }, + }); +} + +export class InMemoryStorage { + constructor() { + this._entries = new Map(); + } + + async has(key) { + ensureString(key, "key"); + return this._entries.has(key); + } + + async get(key) { + ensureString(key, "key"); + const entry = this._entries.get(key); + if (!entry) return null; + // Return a copy to prevent caller mutation + return entry.value.slice(0); + } + + async put(key, value, meta = {}) { + ensureString(key, "key"); + const buffer = ensureBuffer(value, "value"); + this._entries.set(key, { + value: buffer, + size: buffer.byteLength, + sha256: typeof meta?.sha256 === "string" ? meta.sha256 : "", + updatedAt: Date.now(), + }); + } + + async delete(key) { + ensureString(key, "key"); + return this._entries.delete(key); + } + + async list() { + return [...this._entries.entries()] + .map(([key, entry]) => ({ + key, + size: entry.size, + sha256: entry.sha256, + updatedAt: entry.updatedAt, + })) + .sort((a, b) => a.key.localeCompare(b.key)); + } + + async clear() { + this._entries.clear(); + } +} + +let _indexedDBStorageCtorPromise = null; + +async function loadIndexedDBStorageCtor() { + if (!_indexedDBStorageCtorPromise) { + _indexedDBStorageCtorPromise = import("./indexeddb-storage.js") + .then((mod) => mod.IndexedDBStorage) + .catch((error) => { + _indexedDBStorageCtorPromise = null; + throw error; + }); + } + return _indexedDBStorageCtorPromise; +} + +class LazyIndexedDBStorage { + constructor(dbName) { + this._dbName = dbName; + this._delegate = null; + } + + async _resolve() { + if (!this._delegate) { + const Ctor = await loadIndexedDBStorageCtor(); + this._delegate = new Ctor(this._dbName); + } + return this._delegate; + } + + async has(key) { return (await this._resolve()).has(key); } + async get(key) { return (await this._resolve()).get(key); } + async put(key, value, meta) { return (await this._resolve()).put(key, value, meta); } + async delete(key) { return (await this._resolve()).delete(key); } + async list() { return (await this._resolve()).list(); } + async clear() { return (await this._resolve()).clear(); } +} + +export function createIndexedDBStorage(dbName = "trans2former-ocr-cache") { + if (typeof globalThis !== "undefined" && globalThis.indexedDB) { + return new LazyIndexedDBStorage(dbName); + } + return new InMemoryStorage(); +} + +export const defaultOCRStorage = createIndexedDBStorage(); diff --git a/public/core/ocr/ocr-to-fixed-layout.js b/public/core/ocr/ocr-to-fixed-layout.js new file mode 100644 index 0000000..944dc46 --- /dev/null +++ b/public/core/ocr/ocr-to-fixed-layout.js @@ -0,0 +1,80 @@ +import { + createFixedLayoutModel, + createPage, + createTextRun, +} from "../models/fixed-layout.js"; + +export const READING_ORDER_HEURISTIC = "heuristic-yx"; + +function sortLinesByReadingOrder(lines) { + return [...lines].sort((a, b) => { + const ay = a?.bbox?.y ?? 0; + const by = b?.bbox?.y ?? 0; + if (ay !== by) return ay - by; + const ax = a?.bbox?.x ?? 0; + const bx = b?.bbox?.x ?? 0; + return ax - bx; + }); +} + +function pageDimensions(page) { + if (!page) return { width: 0, height: 0 }; + return { + width: Number(page.width) || 0, + height: Number(page.height) || 0, + }; +} + +export function ocrResultToFixedLayoutPage(ocrResult, { pageNumber = 1, pageIndex = 0 } = {}) { + const sourcePage = Array.isArray(ocrResult?.pages) ? ocrResult.pages[pageIndex] : null; + const lines = Array.isArray(sourcePage?.lines) ? sourcePage.lines : []; + const sorted = sortLinesByReadingOrder(lines); + const { width, height } = pageDimensions(sourcePage); + return createPage({ + pageNumber, + size: { width, height, unit: "px" }, + textRuns: sorted.map((line) => createTextRun({ + text: line.text || "", + bbox: line.bbox || null, + confidence: line.confidence ?? 0, + })), + readingOrderHint: READING_ORDER_HEURISTIC, + }); +} + +export function mergeOCRResultsToFixedLayout(results = [], { language = "auto", engine = "", modelVersion = "" } = {}) { + if (!Array.isArray(results)) results = []; + const pages = []; + let totalRuns = 0; + let totalConfidence = 0; + let totalConfidenceSamples = 0; + let runtimeMsTotal = 0; + for (let i = 0; i < results.length; i += 1) { + const page = ocrResultToFixedLayoutPage(results[i], { pageNumber: i + 1, pageIndex: 0 }); + pages.push(page); + totalRuns += page.textRuns.length; + runtimeMsTotal += Number(results[i]?.runtimeMs) || 0; + if (typeof results[i]?.averageConfidence === "number") { + totalConfidence += results[i].averageConfidence; + totalConfidenceSamples += 1; + } + } + const averageConfidence = totalConfidenceSamples > 0 + ? totalConfidence / totalConfidenceSamples + : 0; + return createFixedLayoutModel({ + pages, + metadata: { + readingOrder: READING_ORDER_HEURISTIC, + ocr: { + language: language || results[0]?.language || "auto", + pageCount: pages.length, + textRunCount: totalRuns, + averageConfidence, + runtimeMs: runtimeMsTotal, + engine: engine || results[0]?.engine || "", + modelVersion: modelVersion || results[0]?.modelVersion || "", + }, + }, + }); +} diff --git a/public/core/ocr/ocr-validator.js b/public/core/ocr/ocr-validator.js new file mode 100644 index 0000000..e8700eb --- /dev/null +++ b/public/core/ocr/ocr-validator.js @@ -0,0 +1,51 @@ +import { createRepairAction } from "../repair-actions.js"; + +const LOW_CONFIDENCE_THRESHOLD = 0.55; +const MAX_ACTIONS_PER_PAGE = 8; + +function isPlainObject(value) { + return Boolean(value) && typeof value === "object" && !Array.isArray(value); +} + +function gatherLowConfidenceLines(model) { + const ocr = model?.metadata?.ocr; + if (!isPlainObject(ocr) || !Array.isArray(ocr.lines)) return []; + return ocr.lines.filter((line) => typeof line?.confidence === "number" && line.confidence < LOW_CONFIDENCE_THRESHOLD); +} + +export function detectOCRLowConfidence(model) { + const lowLines = gatherLowConfidenceLines(model); + if (lowLines.length === 0) return []; + const candidates = []; + const seen = new Set(); + for (const line of lowLines) { + if (candidates.length >= MAX_ACTIONS_PER_PAGE) break; + if (typeof line.text !== "string" || line.text.length === 0) continue; + const targetId = typeof line.blockId === "string" && line.blockId.length > 0 + ? line.blockId + : `ocr-line-${line.pageIndex ?? 0}-${line.lineIndex ?? 0}`; + if (seen.has(targetId)) continue; + seen.add(targetId); + candidates.push(createRepairAction({ + actionType: "replaceTextRun", + targetId, + before: line.text, + after: line.text, + confidence: 1 - line.confidence, + evidence: { + source: "ocr-low-confidence", + ocrConfidence: line.confidence, + threshold: LOW_CONFIDENCE_THRESHOLD, + engineId: model?.metadata?.modelReview?.engine || "", + modelVersion: model?.metadata?.modelReview?.modelVersion || "", + language: model?.metadata?.modelReview?.ocr?.language || "auto", + bbox: line.bbox || null, + pageIndex: line.pageIndex ?? null, + lineIndex: line.lineIndex ?? null, + }, + targetField: "text", + modelVersion: model?.metadata?.modelReview?.modelVersion || "rule-based", + })); + } + return candidates; +} diff --git a/public/core/ocr/ocr-warnings.js b/public/core/ocr/ocr-warnings.js new file mode 100644 index 0000000..40df68e --- /dev/null +++ b/public/core/ocr/ocr-warnings.js @@ -0,0 +1,66 @@ +import { createWarning } from "../warnings.js"; + +export const OCR_UNAVAILABLE = "OCR_UNAVAILABLE"; +export const OCR_LOW_CONFIDENCE = "OCR_LOW_CONFIDENCE"; +export const OCR_ENGINE_FAILED = "OCR_ENGINE_FAILED"; +export const OCR_DEGRADED_ROUTE = "OCR_DEGRADED_ROUTE"; + +export const OCR_WARNING_CODES = Object.freeze([ + OCR_UNAVAILABLE, + OCR_LOW_CONFIDENCE, + OCR_ENGINE_FAILED, + OCR_DEGRADED_ROUTE, +]); + +export function createOCRUnavailableWarning(details = {}) { + const engineId = details.engineId || "placeholder"; + const manifestId = details.manifestId || "ocr-text.placeholder.0.1.0"; + return createWarning( + "info", + OCR_UNAVAILABLE, + "OCR 模型未启用,图片中的文字未被识别;仅保留资产引用与可读 fallback。", + { engineId, manifestId, reason: details.reason || "engine-not-enabled", task: details.task || "ocr-text" }, + ); +} + +export function createOCREngineFailedWarning(details = {}) { + return createWarning( + "lossy", + OCR_ENGINE_FAILED, + `OCR 引擎执行失败:${details.reason || details.cause || "未知原因"}。降级到资产引用路径。`, + { + engineId: details.engineId || "unknown", + manifestId: details.manifestId || "", + cause: details.cause || details.reason || "unknown", + }, + ); +} + +export function createOCRLowConfidenceWarning(details = {}) { + const confidence = typeof details.averageConfidence === "number" + ? details.averageConfidence.toFixed(2) + : "未知"; + return createWarning( + "lossy", + OCR_LOW_CONFIDENCE, + `OCR 平均置信度较低(${confidence}),结果仅供参考。`, + { + averageConfidence: details.averageConfidence ?? null, + threshold: details.threshold ?? null, + engineId: details.engineId || "", + }, + ); +} + +export function createOCRDegradedRouteWarning(details = {}) { + return createWarning( + "info", + OCR_DEGRADED_ROUTE, + "OCR 模型未就绪,当前路径以降级模式输出(仅保留可读 fallback)。", + { + task: details.task || "ocr-text", + manifestId: details.manifestId || "", + reason: details.reason || "engine-not-enabled", + }, + ); +} diff --git a/public/core/ocr/pdf-rasterizer-browser.js b/public/core/ocr/pdf-rasterizer-browser.js new file mode 100644 index 0000000..c038336 --- /dev/null +++ b/public/core/ocr/pdf-rasterizer-browser.js @@ -0,0 +1,136 @@ +import { ConversionError } from "../conversion-error.js"; + +const VENDOR_PDFJS = "/vendor/pdfjs/pdf.min.mjs"; + +function ensureBrowserRuntime() { + if (typeof globalThis === "undefined") { + throw new ConversionError("Browser PDF rasterizer needs a DOM runtime.", { + category: "convert", + code: "OCR_RASTERIZER_UNAVAILABLE", + details: { reason: "missing-globalThis" }, + }); + } + if (typeof globalThis.document?.createElement !== "function") { + throw new ConversionError("Browser PDF rasterizer needs document.createElement.", { + category: "convert", + code: "OCR_RASTERIZER_UNAVAILABLE", + details: { reason: "missing-document" }, + }); + } +} + +function decodePdfContent(content) { + if (content instanceof Uint8Array) return content; + if (content instanceof ArrayBuffer) return new Uint8Array(content); + if (typeof content === "string") { + if (content.startsWith("data:")) { + const commaIdx = content.indexOf(","); + const meta = content.slice(5, commaIdx); + const isBase64 = meta.includes(";base64"); + const payload = content.slice(commaIdx + 1); + if (isBase64) { + const decoded = globalThis.atob(payload); + const bytes = new Uint8Array(decoded.length); + for (let i = 0; i < decoded.length; i += 1) bytes[i] = decoded.charCodeAt(i); + return bytes; + } + return new TextEncoder().encode(decodeURIComponent(payload)); + } + const bytes = new Uint8Array(content.length); + for (let i = 0; i < content.length; i += 1) bytes[i] = content.charCodeAt(i) & 0xff; + return bytes; + } + throw new ConversionError("Unsupported PDF content type for browser rasterizer.", { + category: "validate", + code: "OCR_RASTERIZER_FAILED", + details: { reason: "unsupported-content-type" }, + }); +} + +async function loadPdfJs(vendorUrl = VENDOR_PDFJS) { + try { + const mod = await import(/* @vite-ignore */ vendorUrl); + if (typeof mod?.getDocument !== "function") { + throw new Error("vendor pdfjs missing getDocument"); + } + return mod; + } catch (error) { + throw new ConversionError(`PDF.js vendor 加载失败:${error?.message || error}`, { + category: "convert", + code: "OCR_RASTERIZER_UNAVAILABLE", + details: { reason: "vendor-pdfjs-load-failed", cause: String(error?.name || error?.message || "unknown"), vendorUrl }, + }); + } +} + +async function openDocument(pdfjs, content) { + const data = decodePdfContent(content); + const loadingTask = pdfjs.getDocument({ data, isEvalSupported: false, disableFontFace: true }); + try { + return await loadingTask.promise; + } catch (error) { + throw new ConversionError(`PDF document 解析失败:${error?.message || error}`, { + category: "convert", + code: "OCR_RASTERIZER_FAILED", + details: { reason: "pdf-document-open-failed", cause: String(error?.name || error?.message || "unknown") }, + }); + } +} + +export function createBrowserPdfPageRasterizer({ vendorUrl = VENDOR_PDFJS } = {}) { + let pdfjsPromise = null; + async function getPdfJs() { + if (!pdfjsPromise) pdfjsPromise = loadPdfJs(vendorUrl); + return pdfjsPromise; + } + + return Object.freeze({ + async countPages({ content }) { + ensureBrowserRuntime(); + const pdfjs = await getPdfJs(); + const document = await openDocument(pdfjs, content); + try { + return document.numPages; + } finally { + if (typeof document.destroy === "function") document.destroy(); + } + }, + async rasterize({ content, pageIndex = 0, dpi = 144 }) { + ensureBrowserRuntime(); + const pdfjs = await getPdfJs(); + const document = await openDocument(pdfjs, content); + try { + const page = await document.getPage(pageIndex + 1); + try { + const scale = Math.max(0.5, dpi / 72); + const viewport = page.getViewport({ scale }); + const canvas = globalThis.document.createElement("canvas"); + canvas.width = Math.ceil(viewport.width); + canvas.height = Math.ceil(viewport.height); + const canvasContext = canvas.getContext("2d"); + if (!canvasContext) { + throw new ConversionError("Canvas context (2d) 不可用,无法 rasterize。", { + category: "convert", + code: "OCR_RASTERIZER_FAILED", + details: { reason: "canvas-context-missing" }, + }); + } + await page.render({ canvasContext, viewport }).promise; + const dataUrl = canvas.toDataURL("image/png"); + return { dataUrl, width: canvas.width, height: canvas.height }; + } finally { + if (typeof page.cleanup === "function") page.cleanup(); + } + } catch (error) { + if (error instanceof ConversionError) throw error; + throw new ConversionError(`Rasterize PDF page ${pageIndex} 失败:${error?.message || error}`, { + category: "convert", + code: "OCR_RASTERIZER_FAILED", + details: { reason: "page-render-failed", pageIndex, cause: String(error?.name || error?.message || "unknown") }, + }); + } finally { + if (typeof document.destroy === "function") document.destroy(); + } + }, + }); +} diff --git a/public/core/ocr/pdf-rasterizer.js b/public/core/ocr/pdf-rasterizer.js new file mode 100644 index 0000000..c9efe19 --- /dev/null +++ b/public/core/ocr/pdf-rasterizer.js @@ -0,0 +1,138 @@ +import { ConversionError } from "../conversion-error.js"; +import { expandPdfContentForTextExtraction, readPdf } from "../../formats/pdf.js"; + +export const OCR_RASTERIZER_UNAVAILABLE = "OCR_RASTERIZER_UNAVAILABLE"; +export const OCR_RASTERIZER_FAILED = "OCR_RASTERIZER_FAILED"; + +const DEFAULT_SCAN_PDF_THRESHOLD = 300; +const DEFAULT_DPI = 144; + +function textOfBlock(block) { + if (!block || typeof block !== "object") return ""; + if (typeof block.text === "string") return block.text; + if (Array.isArray(block.items)) return block.items.join(" "); + if (Array.isArray(block.rows)) return block.rows.flat().join(" "); + if (typeof block.code === "string") return block.code; + if (typeof block.content === "string") return block.content; + return ""; +} + +function countModelText(model) { + const blocks = Array.isArray(model?.blocks) ? model.blocks : []; + let total = 0; + for (const block of blocks) { + total += textOfBlock(block).replace(/\s+/g, "").length; + } + return total; +} + +const PDFJS_PAYLOAD_MARKER = "% Trans2Former PDFJS_TEXT_START"; + +export async function isScannedPdf(content, options = {}) { + const threshold = typeof options?.scanPdfThreshold === "number" + ? options.scanPdfThreshold + : DEFAULT_SCAN_PDF_THRESHOLD; + try { + const expanded = await expandPdfContentForTextExtraction(content); + const hasPdfJsPayload = typeof expanded === "string" && expanded.includes(PDFJS_PAYLOAD_MARKER); + if (!hasPdfJsPayload) { + return { + scanned: true, + extractedChars: 0, + pageCount: 0, + threshold, + reason: "no-pdfjs-payload", + }; + } + const model = readPdf({ content: expanded, title: "isScannedPdf-probe", fileName: "" }); + const extractedChars = countModelText(model); + const pageCount = Array.isArray(model?.metadata?.pages) ? model.metadata.pages.length : 0; + return { + scanned: extractedChars < threshold, + extractedChars, + pageCount, + threshold, + reason: extractedChars < threshold ? "low-extracted-text" : "text-pdf", + }; + } catch (error) { + return { + scanned: true, + extractedChars: 0, + pageCount: 0, + threshold, + reason: `extraction-failed:${error?.code || error?.message || "unknown"}`, + }; + } +} + +function isBrowserRuntime() { + return typeof globalThis !== "undefined" + && typeof globalThis.document?.createElement === "function"; +} + +let _injectedRasterizer = null; +let _autoBrowserImpl = null; +let _autoBrowserLoadFailed = false; + +async function tryLoadBrowserRasterizer() { + if (_autoBrowserImpl) return _autoBrowserImpl; + if (_autoBrowserLoadFailed) return null; + if (!isBrowserRuntime()) { + _autoBrowserLoadFailed = true; + return null; + } + try { + const mod = await import("./pdf-rasterizer-browser.js"); + _autoBrowserImpl = mod.createBrowserPdfPageRasterizer(); + return _autoBrowserImpl; + } catch (error) { + _autoBrowserLoadFailed = true; + return null; + } +} + +function throwUnavailable(operation) { + throw new ConversionError( + `默认 PDF rasterizer 在当前运行时不可用(${operation})。请用 setPdfPageRasterizer 注入实现,或在浏览器/Tauri 端启用 vendor pdfjs。`, + { + category: "convert", + code: OCR_RASTERIZER_UNAVAILABLE, + details: { reason: "no-runtime-rasterizer", operation }, + }, + ); +} + +export const defaultPdfPageRasterizer = Object.freeze({ + async rasterize(args) { + if (_injectedRasterizer) return _injectedRasterizer.rasterize(args); + const browserImpl = await tryLoadBrowserRasterizer(); + if (browserImpl) return browserImpl.rasterize(args); + throwUnavailable("rasterize"); + }, + async countPages(args) { + if (_injectedRasterizer) return _injectedRasterizer.countPages(args); + const browserImpl = await tryLoadBrowserRasterizer(); + if (browserImpl) return browserImpl.countPages(args); + throwUnavailable("countPages"); + }, + get DPI_DEFAULT() { + return DEFAULT_DPI; + }, +}); + +export function setPdfPageRasterizer(impl) { + if (!impl || typeof impl.rasterize !== "function" || typeof impl.countPages !== "function") { + throw new ConversionError("setPdfPageRasterizer requires { rasterize, countPages } functions.", { + category: "validate", + code: "OCR_RASTERIZER_INVALID", + details: { reason: "missing-methods" }, + }); + } + _injectedRasterizer = impl; +} + +export function resetPdfPageRasterizer() { + _injectedRasterizer = null; + _autoBrowserImpl = null; + _autoBrowserLoadFailed = false; +} diff --git a/public/core/ocr/placeholder-engine.js b/public/core/ocr/placeholder-engine.js new file mode 100644 index 0000000..60303bf --- /dev/null +++ b/public/core/ocr/placeholder-engine.js @@ -0,0 +1,23 @@ +import { ConversionError } from "../conversion-error.js"; +import { OCR_UNAVAILABLE } from "./ocr-warnings.js"; + +export const PLACEHOLDER_OCR_MANIFEST_ID = "ocr-text.placeholder.0.1.0"; + +export const placeholderOCREngine = Object.freeze({ + id: "placeholder", + taskCapabilities: ["ocr-text"], + manifestId: PLACEHOLDER_OCR_MANIFEST_ID, + isAvailable() { + return false; + }, + async recognize() { + throw new ConversionError( + "OCR placeholder engine cannot recognize images; enable a real OCR engine first.", + { + category: "convert", + code: OCR_UNAVAILABLE, + details: { engineId: "placeholder", manifestId: PLACEHOLDER_OCR_MANIFEST_ID, reason: "engine-not-enabled" }, + }, + ); + }, +}); diff --git a/public/core/ocr/png-ocr.js b/public/core/ocr/png-ocr.js new file mode 100644 index 0000000..2c64e1c --- /dev/null +++ b/public/core/ocr/png-ocr.js @@ -0,0 +1,161 @@ +import { createParagraph } from "../document-model.js"; +import { withWarnings } from "../warnings.js"; +import { defaultOCRRegistry } from "./ocr-engine.js"; +import { summarizeOCRResult } from "./ocr-result.js"; +import { + createOCRUnavailableWarning, + createOCREngineFailedWarning, + createOCRLowConfidenceWarning, +} from "./ocr-warnings.js"; + +const LOW_CONFIDENCE_THRESHOLD = 0.6; + +function findFirstImageAsset(model) { + const blocks = Array.isArray(model?.blocks) ? model.blocks : []; + for (const block of blocks) { + if (block?.type === "asset" || block?.type === "image") return block; + } + return null; +} + +function resolveAssetData(model, assetBlock) { + if (assetBlock?.src) return assetBlock.src; + const assets = Array.isArray(model?.assets) ? model.assets : []; + const assetId = assetBlock?.assetId; + if (!assetId) return null; + const found = assets.find((entry) => entry?.id === assetId); + return found?.data || null; +} + +function cloneModel(model) { + return { + ...model, + blocks: [...(model.blocks || [])], + assets: [...(model.assets || [])], + metadata: { ...(model.metadata || {}) }, + }; +} + +function paragraphsFromOCR(result) { + const pages = Array.isArray(result?.pages) ? result.pages : []; + const paragraphs = []; + for (const page of pages) { + const lines = Array.isArray(page.lines) ? page.lines : []; + const text = lines.map((line) => line.text).filter(Boolean).join("\n"); + if (text.trim().length > 0) paragraphs.push(createParagraph(text)); + } + if (paragraphs.length === 0 && typeof result?.fullText === "string" && result.fullText.trim().length > 0) { + paragraphs.push(createParagraph(result.fullText)); + } + return paragraphs; +} + +export async function enhanceWithOCR(model, { engine = null, registry = defaultOCRRegistry } = {}) { + const resolvedEngine = engine || registry.pickForTask("ocr-text"); + if (!resolvedEngine || !resolvedEngine.isAvailable()) { + const next = cloneModel(model); + next.metadata = withWarnings(next.metadata, [ + createOCRUnavailableWarning({ + engineId: resolvedEngine?.id || "none", + manifestId: resolvedEngine?.manifestId || "", + reason: resolvedEngine ? "engine-not-enabled" : "no-engine-registered", + }), + ]); + return next; + } + + const assetBlock = findFirstImageAsset(model); + if (!assetBlock) { + const next = cloneModel(model); + next.metadata = withWarnings(next.metadata, [ + createOCREngineFailedWarning({ + engineId: resolvedEngine.id, + manifestId: resolvedEngine.manifestId || "", + reason: "no-image-asset", + }), + ]); + return next; + } + + const image = resolveAssetData(model, assetBlock); + if (!image) { + const next = cloneModel(model); + next.metadata = withWarnings(next.metadata, [ + createOCREngineFailedWarning({ + engineId: resolvedEngine.id, + manifestId: resolvedEngine.manifestId || "", + reason: "asset-data-missing", + }), + ]); + return next; + } + + let result; + try { + result = await resolvedEngine.recognize({ image, options: { language: "chi_sim" } }); + } catch (error) { + const next = cloneModel(model); + next.metadata = withWarnings(next.metadata, [ + createOCREngineFailedWarning({ + engineId: resolvedEngine.id, + manifestId: resolvedEngine.manifestId || "", + reason: error?.code || "recognize-threw", + cause: error?.message || String(error), + }), + ]); + return next; + } + + const paragraphs = paragraphsFromOCR(result); + const ocrWarnings = []; + if (typeof result?.averageConfidence === "number" && result.averageConfidence < LOW_CONFIDENCE_THRESHOLD) { + ocrWarnings.push(createOCRLowConfidenceWarning({ + averageConfidence: result.averageConfidence, + threshold: LOW_CONFIDENCE_THRESHOLD, + engineId: resolvedEngine.id, + })); + } + + const enhanced = cloneModel(model); + const appendedStart = enhanced.blocks.length; + enhanced.blocks = [...enhanced.blocks, ...paragraphs]; + enhanced.metadata = withWarnings(enhanced.metadata, ocrWarnings); + enhanced.metadata.modelReview = { + ...(enhanced.metadata.modelReview || {}), + engine: resolvedEngine.id, + modelVersion: result?.modelVersion || "", + tasks: Array.from(new Set([...(enhanced.metadata.modelReview?.tasks || []), "ocr-text-recognition"])), + inferenceMode: "local", + ocr: summarizeOCRResult(result), + }; + enhanced.metadata.ocr = collectLineMetadata(result, enhanced.blocks, appendedStart); + return enhanced; +} + +function collectLineMetadata(result, blocks, appendedStart) { + const pages = Array.isArray(result?.pages) ? result.pages : []; + const lines = []; + pages.forEach((page, pageIndex) => { + const pageLines = Array.isArray(page.lines) ? page.lines : []; + pageLines.forEach((line, lineIndex) => { + // Each appended paragraph corresponds to a page; pick the block that + // received the paragraph for this page so repair candidates can refer + // back to it by id. + const block = blocks[appendedStart + pageIndex]; + lines.push({ + pageIndex, + lineIndex, + text: line.text || "", + confidence: typeof line.confidence === "number" ? line.confidence : 0, + bbox: line.bbox || null, + blockId: block?.id || "", + }); + }); + }); + return { + language: result?.language || "auto", + pageCount: pages.length, + lineCount: lines.length, + lines, + }; +} diff --git a/public/core/ocr/scan-pdf-stage.js b/public/core/ocr/scan-pdf-stage.js new file mode 100644 index 0000000..9004695 --- /dev/null +++ b/public/core/ocr/scan-pdf-stage.js @@ -0,0 +1,197 @@ +import { createParagraph } from "../document-model.js"; +import { createWarning, withWarnings } from "../warnings.js"; +import { defaultOCRRegistry } from "./ocr-engine.js"; +import { createOCREngineFailedWarning, createOCRUnavailableWarning, createOCRLowConfidenceWarning } from "./ocr-warnings.js"; +import { defaultPdfPageRasterizer } from "./pdf-rasterizer.js"; +import { mergeOCRResultsToFixedLayout } from "./ocr-to-fixed-layout.js"; +import { fixedLayoutToSemantic } from "../models/mappers.js"; +import { getFixedLayoutSummary } from "../models/fixed-layout.js"; + +export const MODEL_VISUAL_FIDELITY_LOST = "MODEL_VISUAL_FIDELITY_LOST"; +export const MODEL_TEXT_ORDER_HEURISTIC = "MODEL_TEXT_ORDER_HEURISTIC"; + +const DEFAULT_MAX_SCAN_PAGES = 5; +const DEFAULT_DPI = 144; +const LOW_CONFIDENCE_THRESHOLD = 0.6; + +function cloneModel(model) { + return { + ...model, + blocks: [...(model.blocks || [])], + assets: [...(model.assets || [])], + metadata: { ...(model.metadata || {}) }, + }; +} + +function paragraphsFromPageResult(result) { + const pages = Array.isArray(result?.pages) ? result.pages : []; + const paragraphs = []; + for (const page of pages) { + const lines = Array.isArray(page.lines) ? page.lines : []; + const text = lines.map((line) => line.text).filter(Boolean).join("\n"); + if (text.trim().length > 0) paragraphs.push(createParagraph(text)); + } + if (paragraphs.length === 0 && typeof result?.fullText === "string" && result.fullText.trim().length > 0) { + paragraphs.push(createParagraph(result.fullText)); + } + return paragraphs; +} + +export async function runScannedPdfOCRStage(model, ctx = {}) { + if (ctx?.options?.ocr?.enabled === false) return model; + const registry = ctx.ocrRegistry || defaultOCRRegistry; + const engine = ctx.ocrEngine || registry.pickForTask("ocr-text"); + if (!engine || !engine.isAvailable()) { + return { + ...model, + metadata: withWarnings(model.metadata || {}, [ + createOCRUnavailableWarning({ + engineId: engine?.id || "none", + manifestId: engine?.manifestId || "", + reason: engine ? "engine-not-enabled" : "no-engine-registered", + task: "ocr-text", + }), + ]), + }; + } + const rasterizer = ctx.rasterizer || defaultPdfPageRasterizer; + const maxPages = typeof ctx?.options?.ocr?.maxScanPages === "number" + ? ctx.options.ocr.maxScanPages + : DEFAULT_MAX_SCAN_PAGES; + const dpi = typeof ctx?.options?.ocr?.dpi === "number" ? ctx.options.ocr.dpi : DEFAULT_DPI; + + let pageCount; + try { + pageCount = await rasterizer.countPages({ content: ctx.content }); + } catch (error) { + return { + ...model, + metadata: withWarnings(model.metadata || {}, [ + createOCREngineFailedWarning({ + engineId: engine.id, + manifestId: engine.manifestId || "", + reason: error?.code || "rasterizer-count-pages-failed", + cause: error?.message || String(error), + }), + ]), + }; + } + const effectivePages = Math.min(maxPages, Math.max(1, pageCount || 0)); + if (effectivePages === 0) return model; + + const enhanced = cloneModel(model); + const lines = []; + const aggregateConfidences = []; + const pageResults = []; + let runtimeMsTotal = 0; + let language = ""; + let modelVersion = ""; + + for (let pageIndex = 0; pageIndex < effectivePages; pageIndex += 1) { + let pageResult; + try { + const rendered = await rasterizer.rasterize({ content: ctx.content, pageIndex, dpi }); + pageResult = await engine.recognize({ image: rendered.dataUrl, options: { language: "chi_sim" } }); + } catch (error) { + enhanced.metadata = withWarnings(enhanced.metadata, [ + createOCREngineFailedWarning({ + engineId: engine.id, + manifestId: engine.manifestId || "", + reason: error?.code || "page-stage-failed", + cause: `page=${pageIndex}: ${error?.message || error}`, + }), + ]); + continue; + } + pageResults.push(pageResult); + runtimeMsTotal += pageResult?.runtimeMs || 0; + if (typeof pageResult?.averageConfidence === "number") aggregateConfidences.push(pageResult.averageConfidence); + language = language || pageResult?.language || ""; + modelVersion = modelVersion || pageResult?.modelVersion || ""; + const pageLines = Array.isArray(pageResult?.pages?.[0]?.lines) ? pageResult.pages[0].lines : []; + pageLines.forEach((line, lineIndex) => { + lines.push({ + pageIndex, + lineIndex, + text: line.text || "", + confidence: typeof line.confidence === "number" ? line.confidence : 0, + bbox: line.bbox || null, + blockId: "", + }); + }); + } + + const averageConfidence = aggregateConfidences.length > 0 + ? aggregateConfidences.reduce((acc, value) => acc + value, 0) / aggregateConfidences.length + : 0; + + const fixedLayout = mergeOCRResultsToFixedLayout(pageResults, { language, engine: engine.id, modelVersion }); + enhanced.fixedLayout = fixedLayout; + + const appendedStart = enhanced.blocks.length; + if (pageResults.length > 0) { + const semanticFromLayout = fixedLayoutToSemantic(fixedLayout, { + title: enhanced.title || "scan-ocr", + sourceFormat: enhanced.sourceFormat || "pdf", + }); + enhanced.blocks.push(...(semanticFromLayout.blocks || [])); + enhanced.metadata = withWarnings(enhanced.metadata, [ + createWarning( + "info", + MODEL_VISUAL_FIDELITY_LOST, + "扫描 PDF OCR 仅恢复文本,不还原原始版面、字体、图像。质量报告以文本为准。", + { engineId: engine.id, pageCount: pageResults.length }, + ), + createWarning( + "info", + MODEL_TEXT_ORDER_HEURISTIC, + "扫描 PDF 阅读顺序使用 bbox y → x 启发式,未做多栏 / 标题层级推断。", + { engineId: engine.id, readingOrder: "heuristic-yx" }, + ), + ]); + } + + // Re-resolve blockId for each ocr line after blocks were appended via mapper + let assignedFromBlockIndex = appendedStart; + for (const ocrLine of lines) { + const blockSlot = enhanced.blocks[assignedFromBlockIndex]; + if (blockSlot) ocrLine.blockId = blockSlot.id || ""; + } + + enhanced.metadata.ocr = { + language: language || "auto", + pageCount: effectivePages, + lineCount: lines.length, + lines, + }; + enhanced.metadata.modelReview = { + ...(enhanced.metadata.modelReview || {}), + engine: engine.id, + modelVersion: modelVersion || "", + tasks: Array.from(new Set([...(enhanced.metadata.modelReview?.tasks || []), "ocr-text-recognition", "scan-pdf-rasterize"])), + inferenceMode: "local", + ocr: { + pageCount: effectivePages, + lineCount: lines.length, + averageConfidence, + runtimeMs: runtimeMsTotal, + engine: engine.id, + modelVersion: modelVersion || "", + language: language || "auto", + fullTextLength: lines.reduce((acc, line) => acc + (line.text?.length || 0), 0), + fixedLayout: getFixedLayoutSummary(fixedLayout), + }, + }; + + if (averageConfidence > 0 && averageConfidence < LOW_CONFIDENCE_THRESHOLD) { + enhanced.metadata = withWarnings(enhanced.metadata, [ + createOCRLowConfidenceWarning({ + averageConfidence, + threshold: LOW_CONFIDENCE_THRESHOLD, + engineId: engine.id, + }), + ]); + } + + return enhanced; +} diff --git a/public/core/ocr/tesseract-bootstrap.js b/public/core/ocr/tesseract-bootstrap.js new file mode 100644 index 0000000..386a5d9 --- /dev/null +++ b/public/core/ocr/tesseract-bootstrap.js @@ -0,0 +1,60 @@ +import { + defaultModelCache, + STATUS_NOT_DOWNLOADED, +} from "../model-cache/availability.js"; +import { createModelManifest } from "../model-cache/manifest.js"; +import { defaultOCRRegistry } from "./ocr-engine.js"; +import { ensureOCRBootstrap } from "./ocr-bootstrap.js"; +import { tesseractOCREngine, TESSERACT_MANIFEST_ID } from "./tesseract-engine.js"; + +let bootstrapped = false; + +export function ensureTesseractBootstrap() { + if (bootstrapped) return; + bootstrapped = true; + + // Tesseract bootstrap must run after the placeholder bootstrap, so the + // pickForTask fallback order ends on a tesseract entry (with isAvailable() + // still false in P9-A.2). 这保证 placeholder 与 tesseract 两条都在 registry 中。 + ensureOCRBootstrap(); + + if (!defaultOCRRegistry.has(tesseractOCREngine.id)) { + defaultOCRRegistry.register(tesseractOCREngine); + } + + if (!defaultModelCache.has(TESSERACT_MANIFEST_ID)) { + const manifest = createModelManifest({ + manifestId: TESSERACT_MANIFEST_ID, + task: "ocr-text", + engine: "tesseract", + modelVersion: "5.0.0", + bundleSize: 12 * 1024 * 1024, + quantization: "none", + minMemoryMB: 256, + sources: [ + { kind: "vendor-bundle", path: "public/vendor/tesseract/" }, + { kind: "user-provided", path: "tessdata via 安全中心 → 导入" }, + ], + checksums: { + algorithm: "SHA-256", + digest: "f".repeat(64), + perFile: {}, + }, + fallback: { + onFailure: "use-degraded-route", + message: "Tesseract.js 5.x 占位 manifest;tessdata 与运行时接入留给 P9-A.2.b。", + }, + ui: { + label: "Tesseract.js OCR", + description: "本地轻量 OCR runtime;启用前需在安全中心导入 tessdata (.traineddata)。", + enableHint: "首次启用时本地选择 chi_sim.traineddata / eng.traineddata,写入本地缓存后激活。", + }, + }); + defaultModelCache.register(manifest); + defaultModelCache.setStatus(TESSERACT_MANIFEST_ID, STATUS_NOT_DOWNLOADED, { + message: "等待 P9-A.2.b 通过安全中心导入 tessdata;vendor wasm 已就位但 tessdata 尚未提供。", + }); + } +} + +ensureTesseractBootstrap(); diff --git a/public/core/ocr/tesseract-engine.js b/public/core/ocr/tesseract-engine.js new file mode 100644 index 0000000..55a06ee --- /dev/null +++ b/public/core/ocr/tesseract-engine.js @@ -0,0 +1,122 @@ +import { ConversionError } from "../conversion-error.js"; +import { defaultOCRStorage } from "./ocr-storage.js"; +import { + OCR_ENGINE_FAILED, + OCR_UNAVAILABLE, +} from "./ocr-warnings.js"; +import { + createTesseractWorker, + disposeWorker, + loadTesseractRuntime, + runRecognize, +} from "./tesseract-runtime.js"; + +export const TESSERACT_MANIFEST_ID = "ocr-text.tesseract.5.0.0"; + +const TESSDATA_KEY_PREFIX = "tesseract/"; +const DEFAULT_LANGUAGES = ["chi_sim", "eng"]; + +function vendorReady() { + return Boolean(globalThis.__t2fTesseractVendorReady); +} + +async function hasAnyTessdata(storage, languages) { + for (const language of languages) { + if (await storage.has(`${TESSDATA_KEY_PREFIX}${language}.traineddata`)) { + return language; + } + } + return null; +} + +export const tesseractOCREngine = Object.freeze({ + id: "tesseract-zh-en", + taskCapabilities: ["ocr-text"], + manifestId: TESSERACT_MANIFEST_ID, + + // OCREngineRegistry expects a synchronous isAvailable. We expose a synchronous + // signature backed by a cached probe; the probe is updated by ensureProbe() + // before any recognize() call. Until P9-A.2.b populates IndexedDB, this stays + // false. + isAvailable() { + if (!vendorReady()) return false; + return Boolean(tesseractOCREngine._tessdataReady); + }, + + _tessdataReady: false, + _storage: defaultOCRStorage, + + async ensureProbe() { + if (!vendorReady()) { + this._tessdataReady = false; + return false; + } + const language = await hasAnyTessdata(this._storage, DEFAULT_LANGUAGES); + this._tessdataReady = Boolean(language); + return this._tessdataReady; + }, + + async recognize({ image, options } = {}) { + if (!vendorReady()) { + throw new ConversionError( + "Tesseract vendor 资源未就位,无法执行 OCR。请运行 `npm run vendor:tesseract` 同步本地资源。", + { + category: "convert", + code: OCR_UNAVAILABLE, + details: { engineId: "tesseract-zh-en", manifestId: TESSERACT_MANIFEST_ID, reason: "vendor-not-ready" }, + }, + ); + } + const language = await hasAnyTessdata(this._storage, options?.languages || DEFAULT_LANGUAGES); + if (!language) { + throw new ConversionError( + "未在本地缓存中找到 tessdata;请先在安全中心导入 .traineddata 文件。", + { + category: "convert", + code: OCR_UNAVAILABLE, + details: { engineId: "tesseract-zh-en", manifestId: TESSERACT_MANIFEST_ID, reason: "tessdata-missing" }, + }, + ); + } + // P9-A.2 阶段保留真实推理入口,但不在本轮接入;recognize 真实执行留给 P9-A.2.b。 + // 检查 image 至少存在,给 P9-A.2.b 留下接入测试期望的拒绝信号。 + if (!image) { + throw new ConversionError("OCR 输入图像缺失。", { + category: "validate", + code: OCR_ENGINE_FAILED, + details: { engineId: "tesseract-zh-en", reason: "missing-image" }, + }); + } + const namespace = await loadTesseractRuntime(); + const tessdataBuffer = await this._storage.get(`${TESSDATA_KEY_PREFIX}${language}.traineddata`); + if (!tessdataBuffer) { + throw new ConversionError( + `tessdata for ${language} 在导入流程后未被读取到;请重新导入 .traineddata 文件。`, + { + category: "convert", + code: OCR_UNAVAILABLE, + details: { engineId: "tesseract-zh-en", manifestId: TESSERACT_MANIFEST_ID, reason: "tessdata-read-failed", language }, + }, + ); + } + let worker = null; + try { + worker = await createTesseractWorker({ namespace, language, tessdataBuffer }); + const result = await runRecognize(worker, image, { ...(options || {}), language }); + return result; + } catch (error) { + if (error instanceof ConversionError) throw error; + throw new ConversionError(`Tesseract recognize 失败:${error?.message || error}`, { + category: "convert", + code: OCR_ENGINE_FAILED, + details: { engineId: "tesseract-zh-en", reason: "recognize-failed", cause: String(error?.name || error?.message || "unknown") }, + }); + } finally { + await disposeWorker(worker); + } + }, +}); + +export function markTesseractVendorReady(ready = true) { + globalThis.__t2fTesseractVendorReady = Boolean(ready); +} diff --git a/public/core/ocr/tesseract-runtime.js b/public/core/ocr/tesseract-runtime.js new file mode 100644 index 0000000..975222d --- /dev/null +++ b/public/core/ocr/tesseract-runtime.js @@ -0,0 +1,182 @@ +import { ConversionError } from "../conversion-error.js"; +import { createOCRResult } from "./ocr-result.js"; + +export const OCR_VENDOR_LOAD_FAILED = "OCR_VENDOR_LOAD_FAILED"; +export const TESSERACT_VENDOR_PATHS = Object.freeze({ + corePath: "/vendor/tesseract/core/", + workerPath: "/vendor/tesseract/worker/worker.min.js", + mainBundle: "/vendor/tesseract/core/tesseract.min.js", +}); + +let cachedNamespace = null; + +function resolveGlobalTesseract() { + if (typeof globalThis === "undefined") return null; + return globalThis.Tesseract || null; +} + +export async function loadTesseractRuntime() { + if (cachedNamespace) return cachedNamespace; + const existing = resolveGlobalTesseract(); + if (existing && typeof existing.createWorker === "function") { + cachedNamespace = existing; + return cachedNamespace; + } + try { + const mod = await import(/* @vite-ignore */ TESSERACT_VENDOR_PATHS.mainBundle); + if (mod?.createWorker) { + cachedNamespace = mod; + return cachedNamespace; + } + const fallback = resolveGlobalTesseract(); + if (fallback && typeof fallback.createWorker === "function") { + cachedNamespace = fallback; + return cachedNamespace; + } + } catch (error) { + throw new ConversionError( + `Tesseract.js vendor bundle 加载失败:${error?.message || error}`, + { + category: "convert", + code: OCR_VENDOR_LOAD_FAILED, + details: { + path: TESSERACT_VENDOR_PATHS.mainBundle, + cause: String(error?.name || error?.message || "unknown"), + }, + }, + ); + } + throw new ConversionError( + "Tesseract.js vendor 入口未导出 createWorker;请检查 vendor 同步与版本兼容。", + { + category: "convert", + code: OCR_VENDOR_LOAD_FAILED, + details: { path: TESSERACT_VENDOR_PATHS.mainBundle, reason: "missing-createWorker" }, + }, + ); +} + +function bufferToBlobUrl(buffer) { + const blob = new Blob([buffer], { type: "application/octet-stream" }); + return URL.createObjectURL(blob); +} + +export async function createTesseractWorker({ + namespace, + language, + tessdataBuffer, + vendorPaths = TESSERACT_VENDOR_PATHS, +} = {}) { + if (!namespace || typeof namespace.createWorker !== "function") { + throw new ConversionError("Tesseract namespace 未就位,无法创建 worker。", { + category: "convert", + code: OCR_VENDOR_LOAD_FAILED, + details: { reason: "namespace-missing" }, + }); + } + if (!language) { + throw new ConversionError("createTesseractWorker requires a language code.", { + category: "validate", + code: "OCR_ENGINE_INVALID", + details: { reason: "missing-language" }, + }); + } + if (!tessdataBuffer) { + throw new ConversionError("createTesseractWorker requires a tessdata buffer.", { + category: "validate", + code: "OCR_ENGINE_INVALID", + details: { reason: "missing-tessdata" }, + }); + } + const tessdataUrl = bufferToBlobUrl(tessdataBuffer); + try { + const worker = await namespace.createWorker(language, undefined, { + corePath: vendorPaths.corePath, + workerPath: vendorPaths.workerPath, + langPath: tessdataUrl.replace(/\/[^/]+$/, "/"), + cacheMethod: "none", + logger: () => {}, + }); + worker.__t2fTessdataUrl = tessdataUrl; + return worker; + } catch (error) { + URL.revokeObjectURL(tessdataUrl); + throw new ConversionError(`Tesseract worker 初始化失败:${error?.message || error}`, { + category: "convert", + code: "OCR_ENGINE_FAILED", + details: { reason: "worker-init-failed", cause: String(error?.name || error?.message || "unknown") }, + }); + } +} + +export async function runRecognize(worker, image, options = {}) { + if (!worker || typeof worker.recognize !== "function") { + throw new ConversionError("Tesseract worker 缺少 recognize 方法。", { + category: "convert", + code: "OCR_ENGINE_FAILED", + details: { reason: "worker-invalid" }, + }); + } + const startedAt = Date.now(); + try { + const response = await worker.recognize(image, options); + const runtimeMs = Date.now() - startedAt; + return mapTesseractResultToOCR(response, runtimeMs, options); + } catch (error) { + throw new ConversionError(`Tesseract recognize 失败:${error?.message || error}`, { + category: "convert", + code: "OCR_ENGINE_FAILED", + details: { reason: "recognize-failed", cause: String(error?.name || error?.message || "unknown") }, + }); + } finally { + if (worker.__t2fTessdataUrl) { + try { URL.revokeObjectURL(worker.__t2fTessdataUrl); } catch (revokeError) { /* ignore */ } + worker.__t2fTessdataUrl = ""; + } + } +} + +function mapTesseractResultToOCR(response, runtimeMs, options) { + const data = response?.data || {}; + const linesRaw = Array.isArray(data.lines) ? data.lines : []; + const lines = linesRaw.map((line) => ({ + text: typeof line?.text === "string" ? line.text.trim() : "", + confidence: typeof line?.confidence === "number" ? Math.max(0, Math.min(1, line.confidence / 100)) : 0, + bbox: line?.bbox + ? { + x: line.bbox.x0 ?? 0, + y: line.bbox.y0 ?? 0, + w: (line.bbox.x1 ?? 0) - (line.bbox.x0 ?? 0), + h: (line.bbox.y1 ?? 0) - (line.bbox.y0 ?? 0), + } + : null, + })); + const pageWidth = data?.imageWidth ?? data?.width ?? 0; + const pageHeight = data?.imageHeight ?? data?.height ?? 0; + const averageConfidence = typeof data.confidence === "number" + ? Math.max(0, Math.min(1, data.confidence / 100)) + : 0; + return createOCRResult({ + language: options?.language || "auto", + pages: [ + { + pageIndex: 0, + width: pageWidth, + height: pageHeight, + lines, + }, + ], + fullText: typeof data.text === "string" ? data.text : "", + averageConfidence, + runtimeMs, + engine: "tesseract-zh-en", + modelVersion: "5.x", + warnings: [], + }); +} + +export async function disposeWorker(worker) { + if (worker && typeof worker.terminate === "function") { + try { await worker.terminate(); } catch (error) { /* ignore */ } + } +} diff --git a/public/core/repair-actions.js b/public/core/repair-actions.js new file mode 100644 index 0000000..35387d9 --- /dev/null +++ b/public/core/repair-actions.js @@ -0,0 +1,105 @@ +import { ConversionError } from "./conversion-error.js"; + +export const REPAIR_ACTION_TYPES = Object.freeze([ + "replaceTextRun", + "insertTextRun", + "reorderBlocks", + "restoreTableGrid", + "adjustBoundingBox", + "regeneratePageLayout", + "selectFallbackRoute", +]); + +const REQUIRED_FIELDS = ["actionType", "targetId", "before", "after", "confidence", "evidence"]; + +function isPlainObject(value) { + return Boolean(value) && typeof value === "object" && !Array.isArray(value); +} + +export function createRepairAction({ + actionType, + targetId, + before, + after, + confidence, + evidence, + modelVersion = "", + sourcePage = null, + sourceSpan = null, + targetField = null, + fallback = null, +} = {}) { + const action = { + actionType, + targetId, + before, + after, + confidence, + evidence, + modelVersion, + sourcePage, + sourceSpan, + targetField, + fallback, + }; + validateRepairAction(action); + return Object.freeze(action); +} + +export function validateRepairAction(action) { + if (!isPlainObject(action)) { + throw new ConversionError("Repair action must be an object.", { + category: "validate", + code: "REPAIR_ACTION_INVALID", + details: { reason: "not-an-object" }, + }); + } + for (const field of REQUIRED_FIELDS) { + if (action[field] === undefined || action[field] === null) { + throw new ConversionError(`Repair action missing required field: ${field}`, { + category: "validate", + code: "REPAIR_ACTION_INVALID", + details: { reason: "missing-field", field }, + }); + } + } + if (!REPAIR_ACTION_TYPES.includes(action.actionType)) { + throw new ConversionError(`Unknown repair actionType: ${action.actionType}`, { + category: "validate", + code: "REPAIR_ACTION_INVALID", + details: { reason: "unknown-action-type", actionType: action.actionType }, + }); + } + if (typeof action.targetId !== "string" || action.targetId.length === 0) { + throw new ConversionError("Repair action targetId must be a non-empty string.", { + category: "validate", + code: "REPAIR_ACTION_INVALID", + details: { reason: "invalid-target-id" }, + }); + } + if (typeof action.confidence !== "number" || action.confidence < 0 || action.confidence > 1) { + throw new ConversionError("Repair action confidence must be a number in [0, 1].", { + category: "validate", + code: "REPAIR_ACTION_INVALID", + details: { reason: "invalid-confidence", confidence: action.confidence }, + }); + } + if (!isPlainObject(action.evidence)) { + throw new ConversionError("Repair action evidence must be an object.", { + category: "validate", + code: "REPAIR_ACTION_INVALID", + details: { reason: "invalid-evidence" }, + }); + } + return action; +} + +export function summarizeAction(action) { + return { + actionType: action.actionType, + targetId: action.targetId, + confidence: action.confidence, + modelVersion: action.modelVersion || "rule-based", + evidenceKeys: Object.keys(action.evidence || {}), + }; +} diff --git a/public/core/repair-engine.js b/public/core/repair-engine.js new file mode 100644 index 0000000..4af2f54 --- /dev/null +++ b/public/core/repair-engine.js @@ -0,0 +1,279 @@ +import { ensureDocumentAudit } from "./document-audit.js"; +import { validateRepairAction, summarizeAction } from "./repair-actions.js"; +import { DEFAULT_HANDLERS } from "./repair-handlers.js"; +import { DEFAULT_VALIDATORS } from "./repair-validators.js"; +import { detectOCRLowConfidence } from "./ocr/ocr-validator.js"; +import { createWarning, withWarnings } from "./warnings.js"; + +export const MIN_CONFIDENCE = 0.6; + +const ROUND_TRIP_FORMATS = new Set(["md", "html", "json", "csv", "txt", "xml"]); + +function isPlainObject(value) { + return Boolean(value) && typeof value === "object" && !Array.isArray(value); +} + +function blockFingerprint(block) { + if (!isPlainObject(block)) return ""; + if (block.type === "heading") return `h${block.level}|${block.text || ""}`; + if (block.type === "paragraph" || block.type === "quote") return `${block.type}|${block.text || ""}`; + if (block.type === "code") return `code|${block.language || ""}|${block.code || ""}`; + if (block.type === "list") return `list|${block.ordered ? "ol" : "ul"}|${(block.items || []).join("")}`; + if (block.type === "table") { + return `table|${(block.headers || []).join("")}|${(block.rows || []).map((row) => (row || []).join("")).join("")}`; + } + if (block.type === "image" || block.type === "asset") { + return `${block.type}|${block.src || ""}|${block.alt || ""}|${block.assetId || ""}`; + } + if (block.type === "raw") return `raw|${block.format || ""}|${block.content || ""}`; + return block.type || ""; +} + +function modelFingerprint(model) { + return (model.blocks || []).map(blockFingerprint).join(""); +} + +function summarizeQuality(model) { + const report = model.metadata?.qualityReport || {}; + return { + warningCount: report.warningCount ?? 0, + downgradeCount: report.downgradeCount ?? 0, + structureFidelity: report.structureFidelity ?? "unknown", + }; +} + +function isRoundTripEligible(from, to) { + return ROUND_TRIP_FORMATS.has(from) && ROUND_TRIP_FORMATS.has(to) && from === to; +} + +export class RepairEngine { + constructor() { + this.validators = []; + this.handlers = new Map(); + } + + registerValidator(fn) { + if (typeof fn !== "function") throw new TypeError("RepairEngine validator must be a function."); + this.validators.push(fn); + } + + registerHandler(actionType, fn) { + if (typeof actionType !== "string" || actionType.length === 0) { + throw new TypeError("RepairEngine handler requires a non-empty actionType."); + } + if (typeof fn !== "function") { + throw new TypeError("RepairEngine handler must be a function."); + } + this.handlers.set(actionType, fn); + } + + hasHandler(actionType) { + return this.handlers.has(actionType); + } + + proposeActions(model, ctx) { + const proposed = []; + for (const validator of this.validators) { + let actions = []; + try { + actions = validator(model, ctx) || []; + } catch (error) { + actions = []; + } + for (const action of actions) { + try { + validateRepairAction(action); + proposed.push(action); + } catch (error) { + // Malformed validator output - skip without breaking the cycle + } + } + } + return proposed; + } + + applyActions({ model, actions, output, ctx }) { + const applied = []; + const rejected = []; + const recommendations = []; + let currentModel = model; + let currentOutput = output; + let fallbackTo = null; + let fallbackApplied = false; + + for (const action of actions) { + if (action.confidence < MIN_CONFIDENCE) { + rejected.push({ ...summarizeAction(action), note: "below-min-confidence" }); + continue; + } + const handler = this.handlers.get(action.actionType); + if (!handler) { + rejected.push({ ...summarizeAction(action), note: "no-handler" }); + continue; + } + let result; + try { + result = handler({ model: currentModel, action, context: ctx }) || { ok: false, note: "handler-returned-empty" }; + } catch (error) { + rejected.push({ ...summarizeAction(action), note: `handler-error:${error?.code || error?.message || "unknown"}` }); + continue; + } + if (!result.ok) { + rejected.push({ ...summarizeAction(action), note: result.note || "handler-rejected" }); + continue; + } + if (result.outputOverride !== undefined) { + currentOutput = result.outputOverride; + fallbackTo = result.fallbackTo || fallbackTo; + fallbackApplied = true; + currentModel = result.model || currentModel; + } else if (result.fallbackRecommended) { + recommendations.push({ ...summarizeAction(action), fallbackTo: result.fallbackTo, note: result.note || "fallback-recommended" }); + } else { + currentModel = result.model || currentModel; + } + applied.push({ ...summarizeAction(action), note: result.note || "applied" }); + } + + return { model: currentModel, output: currentOutput, applied, rejected, recommendations, fallbackTo, fallbackApplied }; + } + + reverifyModel({ before, after, ctx }) { + const refreshed = ensureDocumentAudit(after, { + content: ctx?.content || "", + reader: ctx?.from || "", + writer: ctx?.to || "", + targetFormat: ctx?.to || "", + fileName: ctx?.fileName || "", + options: ctx?.options || {}, + }); + const beforeQuality = summarizeQuality(before); + const afterQuality = summarizeQuality(refreshed); + const verified = afterQuality.warningCount <= beforeQuality.warningCount + && afterQuality.downgradeCount <= beforeQuality.downgradeCount; + return { refreshed, beforeQuality, afterQuality, verified }; + } + + reverifyRoundTrip({ output, model, ctx }) { + if (!ctx || typeof ctx.read !== "function") { + return { eligible: false, reason: "no-read-hook" }; + } + if (!isRoundTripEligible(ctx.from, ctx.to)) { + return { eligible: false, reason: "format-not-round-trip-safe" }; + } + const payload = output?.data; + if (typeof payload !== "string") { + return { eligible: false, reason: "output-not-string" }; + } + let readBack; + try { + readBack = ctx.read({ content: payload, from: ctx.to, title: ctx.title || "round-trip" }); + } catch (error) { + return { + eligible: true, + ok: false, + reason: `read-back-failed:${error?.code || error?.message || "unknown"}`, + }; + } + const originalFingerprint = modelFingerprint(model); + const readBackFingerprint = modelFingerprint(readBack); + const ok = originalFingerprint === readBackFingerprint; + return { + eligible: true, + ok, + blockCountDelta: (readBack.blocks?.length || 0) - (model.blocks?.length || 0), + fingerprintMatch: ok, + }; + } + + runCycle({ model, output, ctx }) { + const validatorContext = ctx || {}; + const proposed = this.proposeActions(model, validatorContext); + const modelReview = { + engine: "rule-based", + modelVersion: "s2-bootstrap", + tasks: ["lossy-warning-scan", "route-class-check"], + inferenceMode: "local", + runtimeMs: 0, + device: "cpu", + }; + + if (proposed.length === 0) { + const roundTrip = this.reverifyRoundTrip({ output, model, ctx: validatorContext }); + return { + model, + output, + autoRepair: { + attempted: false, + appliedActions: [], + rejectedActions: [], + fallbackUsed: false, + fallbackTo: null, + postRepairVerified: true, + roundTripDelta: roundTrip.eligible ? { ok: roundTrip.ok, blockCountDelta: roundTrip.blockCountDelta ?? 0 } : { ok: null, skipped: roundTrip.reason }, + finalDecision: "verified", + }, + modelReview, + }; + } + + const applyResult = this.applyActions({ model, actions: proposed, output, ctx: validatorContext }); + const fallbackUsed = applyResult.fallbackApplied === true; + const verification = this.reverifyModel({ before: model, after: applyResult.model, ctx: validatorContext }); + const roundTrip = this.reverifyRoundTrip({ output: applyResult.output, model: verification.refreshed, ctx: validatorContext }); + const postRepairVerified = verification.verified && (!roundTrip.eligible || roundTrip.ok !== false); + + let finalDecision; + if (applyResult.applied.length === 0) { + finalDecision = "degraded"; + } else if (postRepairVerified) { + finalDecision = "verified"; + } else { + finalDecision = "failed-quality-gate"; + } + + return { + model: verification.refreshed, + output: applyResult.output, + autoRepair: { + attempted: true, + appliedActions: applyResult.applied, + rejectedActions: applyResult.rejected, + recommendations: applyResult.recommendations, + fallbackUsed, + fallbackTo: applyResult.fallbackTo, + postRepairVerified, + roundTripDelta: roundTrip.eligible ? { ok: roundTrip.ok, blockCountDelta: roundTrip.blockCountDelta ?? 0 } : { ok: null, skipped: roundTrip.reason }, + beforeQuality: verification.beforeQuality, + afterQuality: verification.afterQuality, + finalDecision, + }, + modelReview, + }; + } +} + +export function createDefaultRepairEngine() { + const engine = new RepairEngine(); + for (const validator of DEFAULT_VALIDATORS) { + engine.registerValidator(validator); + } + engine.registerValidator(detectOCRLowConfidence); + for (const [actionType, handler] of Object.entries(DEFAULT_HANDLERS)) { + engine.registerHandler(actionType, handler); + } + return engine; +} + +export const defaultRepairEngine = createDefaultRepairEngine(); + +export function annotateRoundTripSkipped(metadataWarnings, reason) { + return withWarnings({ warnings: metadataWarnings || [] }, [ + createWarning( + "info", + "ROUND_TRIP_NOT_ENABLED", + `Round-trip verification not enabled for this path: ${reason}.`, + { reason }, + ), + ]).warnings; +} diff --git a/public/core/repair-handlers.js b/public/core/repair-handlers.js new file mode 100644 index 0000000..ec0c9d8 --- /dev/null +++ b/public/core/repair-handlers.js @@ -0,0 +1,127 @@ +import { REPAIR_ACTION_TYPES } from "./repair-actions.js"; + +function isPlainObject(value) { + return Boolean(value) && typeof value === "object" && !Array.isArray(value); +} + +function cloneModel(model) { + return JSON.parse(JSON.stringify(model)); +} + +function defaultFieldFor(blockType) { + if (blockType === "heading" || blockType === "paragraph" || blockType === "quote") return "text"; + if (blockType === "code") return "code"; + return null; +} + +function findBlock(model, targetId) { + return (model.blocks || []).find((block) => block.id === targetId); +} + +function replaceWithinString(value, before, after) { + if (typeof value !== "string" || !value.includes(before)) return null; + return value.split(before).join(after); +} + +function applyReplaceTextRun({ model, action }) { + const block = findBlock(model, action.targetId); + if (!block) return { ok: false, model, note: "target-block-not-found" }; + + const cloned = cloneModel(model); + const clonedBlock = findBlock(cloned, action.targetId); + + const listMatch = typeof action.targetField === "string" + ? action.targetField.match(/^items\[(\d+)\]$/) + : null; + + if (listMatch) { + const index = Number(listMatch[1]); + if (!Array.isArray(clonedBlock.items) || index >= clonedBlock.items.length) { + return { ok: false, model, note: "field-out-of-bounds" }; + } + const replaced = replaceWithinString(clonedBlock.items[index], action.before, action.after); + if (replaced === null) return { ok: false, model, note: "before-not-found" }; + clonedBlock.items[index] = replaced; + return { ok: true, model: cloned, note: `replaced-items[${index}]` }; + } + + const field = action.targetField || defaultFieldFor(clonedBlock.type); + if (!field) return { ok: false, model, note: "no-suitable-field" }; + const replaced = replaceWithinString(clonedBlock[field], action.before, action.after); + if (replaced === null) return { ok: false, model, note: "before-not-found" }; + clonedBlock[field] = replaced; + return { ok: true, model: cloned, note: `replaced-${field}` }; +} + +function applySelectFallbackRoute({ model, action, context }) { + if (!isPlainObject(action.fallback) || typeof action.fallback.to !== "string") { + return { ok: false, model, note: "missing-fallback-target" }; + } + const fallbackTo = action.fallback.to; + if (!context || typeof context.prepareConversionModel !== "function" || typeof context.write !== "function") { + return { ok: false, model, note: "registry-handles-missing" }; + } + if (fallbackTo === context.to) { + return { ok: false, model, note: "fallback-equals-original" }; + } + const applyFallback = context.options?.repair?.applyFallback === true; + if (!applyFallback) { + return { + ok: true, + model, + fallbackTo, + fallbackRecommended: true, + note: `fallback-recommended:${fallbackTo}`, + }; + } + try { + const fallbackModel = context.prepareConversionModel({ + content: context.content, + from: context.from, + to: fallbackTo, + title: context.title, + fileName: context.fileName, + options: { ...(context.options || {}), repair: false }, + }); + const fallbackOutput = context.write({ + model: fallbackModel, + to: fallbackTo, + title: context.title, + options: { ...(context.options || {}), repair: false }, + }); + return { + ok: true, + model: fallbackModel, + outputOverride: fallbackOutput, + fallbackTo, + fallbackApplied: true, + note: `fallback-to-${fallbackTo}`, + }; + } catch (error) { + return { + ok: false, + model, + note: `fallback-route-failed:${error?.code || error?.message || "unknown"}`, + }; + } +} + +function placeholderHandler(label) { + return ({ model }) => ({ ok: false, model, note: `handler-not-implemented:${label}` }); +} + +export const DEFAULT_HANDLERS = Object.freeze({ + replaceTextRun: applyReplaceTextRun, + insertTextRun: placeholderHandler("insertTextRun"), + reorderBlocks: placeholderHandler("reorderBlocks"), + restoreTableGrid: placeholderHandler("restoreTableGrid"), + adjustBoundingBox: placeholderHandler("adjustBoundingBox"), + regeneratePageLayout: placeholderHandler("regeneratePageLayout"), + selectFallbackRoute: applySelectFallbackRoute, +}); + +for (const actionType of REPAIR_ACTION_TYPES) { + if (!DEFAULT_HANDLERS[actionType]) { + throw new Error(`Repair handler registry missing entry for ${actionType}`); + } +} diff --git a/public/core/repair-validators.js b/public/core/repair-validators.js new file mode 100644 index 0000000..15c7590 --- /dev/null +++ b/public/core/repair-validators.js @@ -0,0 +1,72 @@ +import { createRepairAction, REPAIR_ACTION_TYPES } from "./repair-actions.js"; + +function isPlainObject(value) { + return Boolean(value) && typeof value === "object" && !Array.isArray(value); +} + +export function detectLossyRepairHints(model) { + const warnings = Array.isArray(model.metadata?.warnings) ? model.metadata.warnings : []; + const actions = []; + for (const warning of warnings) { + if (!isPlainObject(warning) || warning.severity !== "lossy") continue; + const hint = warning.details?.repairAction; + if (!isPlainObject(hint)) continue; + if (!REPAIR_ACTION_TYPES.includes(hint.actionType)) continue; + try { + actions.push(createRepairAction({ + actionType: hint.actionType, + targetId: hint.targetId, + before: hint.before, + after: hint.after, + confidence: typeof hint.confidence === "number" ? hint.confidence : 0.7, + evidence: { source: "lossy-warning", warningCode: warning.code, ...(hint.evidence || {}) }, + modelVersion: hint.modelVersion || "rule-based", + sourcePage: hint.sourcePage ?? null, + sourceSpan: hint.sourceSpan ?? null, + targetField: hint.targetField ?? null, + fallback: hint.fallback ?? null, + })); + } catch (error) { + // Malformed hint - skip silently; validator surface is best-effort + } + } + return actions; +} + +export function detectRouteClassDegradation(model, ctx) { + const routeClass = model.metadata?.conversion?.routeClass; + if (!["generated", "restricted"].includes(routeClass)) return []; + if (!ctx + || typeof ctx.from !== "string" + || typeof ctx.to !== "string" + || typeof ctx.getAllowedOutputFormats !== "function" + || typeof ctx.getRouteDetails !== "function") { + return []; + } + const candidates = ctx.getAllowedOutputFormats(ctx.from).filter((target) => target !== ctx.to); + const safer = candidates.find((target) => { + const details = ctx.getRouteDetails(ctx.from, target); + return details && !["generated", "restricted"].includes(details.routeClass); + }); + if (!safer) return []; + return [createRepairAction({ + actionType: "selectFallbackRoute", + targetId: `route:${ctx.from}->${ctx.to}`, + before: { to: ctx.to, routeClass }, + after: { to: safer, routeClass: ctx.getRouteDetails(ctx.from, safer)?.routeClass || "recommended" }, + confidence: 0.8, + evidence: { + source: "route-class-degradation", + from: ctx.from, + originalTo: ctx.to, + saferTo: safer, + originalRouteClass: routeClass, + }, + fallback: { to: safer }, + })]; +} + +export const DEFAULT_VALIDATORS = Object.freeze([ + detectLossyRepairHints, + detectRouteClassDegradation, +]); diff --git a/public/formats/png.js b/public/formats/png.js index ca2c59c..64d6875 100644 --- a/public/formats/png.js +++ b/public/formats/png.js @@ -4,6 +4,9 @@ import { createHeading, } from "../core/document-model.js"; import { createAssetStore } from "../core/asset-store.js"; +import { withWarnings } from "../core/warnings.js"; +import { defaultOCRRegistry } from "../core/ocr/ocr-engine.js"; +import { createOCRUnavailableWarning } from "../core/ocr/ocr-warnings.js"; export function readPng({ content, title = "image", fileName = "", format = "png" }) { const data = String(content ?? ""); @@ -21,6 +24,16 @@ export function readPng({ content, title = "image", fileName = "", format = "png role: "image", }); + const ocrEngine = defaultOCRRegistry.pickForTask("ocr-text"); + const ocrWarnings = []; + if (!ocrEngine || !ocrEngine.isAvailable()) { + ocrWarnings.push(createOCRUnavailableWarning({ + engineId: ocrEngine?.id || "none", + manifestId: ocrEngine?.manifestId || "", + reason: ocrEngine ? "engine-not-enabled" : "no-engine-registered", + })); + } + return createDocumentModel({ title, sourceFormat: format, @@ -29,5 +42,6 @@ export function readPng({ content, title = "image", fileName = "", format = "png createAssetReference(asset.id, { alt: title, title }), ], assets: assetStore.toJSON(), + metadata: withWarnings({}, ocrWarnings), }); } diff --git a/public/index.html b/public/index.html index b0ce129..5d1f07a 100644 --- a/public/index.html +++ b/public/index.html @@ -6,16 +6,19 @@ Trans2Former + - +
        - -
        -

        Trans2Former

        - 本地优先 · 格式转换 -
        + + +
        +

        Trans2Former

        + 本地优先 · 格式转换 +
        +
        @@ -55,6 +58,8 @@

        Trans2Former

        下载结果 + +
        更多
        @@ -65,7 +70,9 @@

        Trans2Former

        -
        + + +