diff --git a/.github/copilot-instructions.md b/.github/copilot-instructions.md index 5ca6549..3415781 100644 --- a/.github/copilot-instructions.md +++ b/.github/copilot-instructions.md @@ -47,9 +47,9 @@ src/ └── worker/ # Web Worker dispatch + self-contained worker entry fonts/ # Pre-built font data modules (.js/.d.ts) — 16 scripts + TTF source files tools/ # CLI tool (build-font-data.cjs) for converting TTF → importable data modules -scripts/ # Modular sample PDF generation (24 generators, 140+ PDFs; extreme-shaping.ts added in v1.0.3) -test-output/extreme/ # Visual regression baselines for extreme scripts (extreme-bidi.pdf, extreme-tamil.pdf, extreme-bengali-devanagari.pdf, extreme-arabic-harakat.pdf) -tests/ # 1606+ tests (41 files: unit/integration/fuzz/parser) mirroring src/ structure +scripts/ # Modular sample PDF generation (26 generators, 150+ PDFs; emoji-showcase.ts and pdfa-latin-embedding.ts added in v1.1.0) +test-output/extreme/ # Visual regression baselines for extreme scripts (extreme-bidi.pdf, extreme-tamil.pdf, extreme-bengali-devanagari.pdf, extreme-arabic-harakat.pdf, extreme-bidi-isolates.pdf) +tests/ # 1726+ tests (48 files: unit/integration/fuzz/parser) mirroring src/ structure bench/ # Performance benchmarks (vitest bench) docs/ # GitHub Pages landing site (pdfnative.dev) — pure HTML/CSS/JS, zero build deps └── playgrounds/ # Interactive browser playgrounds (extreme-scripts.html, medical-800.html) @@ -78,7 +78,7 @@ npm run build # tsup → dist/ (ESM + CJS + .d.ts) npm run test # vitest run (1588+ tests, 40 files) npm run test:watch # vitest (watch mode) npm run test:coverage # vitest with v8 coverage (thresholds: 90/80/85/90) -npm run test:generate # Generate 140+ sample PDFs → test-output/ (incl. extreme/ baselines) +npm run test:generate # Generate 150+ sample PDFs → test-output/ (incl. extreme/, emoji/, pdfa-latin/ baselines) npm run typecheck # tsc --noEmit npm run typecheck:tests # tsc --project tsconfig.test.json --noEmit npm run typecheck:scripts # tsc --project tsconfig.scripts.json --noEmit @@ -90,7 +90,7 @@ npm run lint # eslint src/ (ESLint 9 + typescript-eslint strict) - Test runner: **vitest** (fast, native ESM, watch mode, v8 coverage) - CI: GitHub Actions — lint/typecheck/test/build on Node 22/24 - Publish: GitHub Actions OIDC with `npm publish --provenance` -- All new code must have tests. Current: ~95% statement coverage, 1588+ tests (40 files) +- All new code must have tests. Current: ~95% statement coverage, 1726+ tests (48 files) ## Conventions @@ -133,6 +133,7 @@ npm run lint # eslint src/ (ESLint 9 + typescript-eslint strict) - URL validation: only `http:`, `https:`, `mailto:` schemes allowed; `javascript:`, `file:`, `data:` blocked; control characters (U+0000–U+001F, U+007F–U+009F) rejected - Color safety: `parseColor()` validates/normalizes hex, tuple, PDF string → safe `"R G B"` output; `normalizeColors()` at layout boundary - Color types: `PdfColor = PdfRgbString | PdfRgbTuple | (string & {})` — union preserves autocomplete for template literals +- BiDi: UAX #9 isolates (LRI U+2066 / RLI U+2067 / FSI U+2068 / PDI U+2069) classified as `BN` and recursed via three-tier dispatcher: public `resolveBidiRuns(text)` finds outermost isolate pairs, internal `resolveBidiRunsForced(text, forcedLevel)` recurses, internal `resolveBidiCore(text, codePoints, cpToStr, forcedLevel?)` runs the W1–W7 / N1–N2 / L2 pipeline. Embeddings (LRE/RLE/LRO/RLO/PDF) deferred to v1.2. - BiDi: simplified UAX #9 — paragraph level detection, weak/neutral type resolution, level assignment, L2 paragraph-level run reordering - BiDi: General Punctuation (U+2010–U+2027, U+2030–U+205E) classified as ON — covers dashes, quotes, ellipsis, primes - BiDi: `resolveBidiRuns()` returns runs in visual order — for RTL paragraphs (paraLevel=1), runs are reversed so LTR text comes first (leftmost) and RTL text last (rightmost) @@ -197,7 +198,10 @@ npm run lint # eslint src/ (ESLint 9 + typescript-eslint strict) - Bengali shaping: `shapeBengaliText()` — GSUB conjunct formation + GPOS mark positioning via `bengali-shaper.ts` - Tamil shaping: `shapeTamilText()` — GSUB substitution + split vowel decomposition via `tamil-shaper.ts` - Devanagari shaping: `shapeDevanagariText()` — cluster building, reph detection, matra reordering, split vowels, GSUB ligature conjuncts, GPOS mark positioning via `devanagari-shaper.ts` -- GSUB LookupType 4 (LigatureSubst): `fontData.ligatures` — `Record` mapping first-glyph GID → arrays of `[resultGID, ...componentGIDs]`; `tryLigature()` pattern used by Bengali, Tamil, and Devanagari shapers +- GSUB LookupType 4 (LigatureSubst): `fontData.ligatures` — `Record` mapping first-glyph GID → arrays of `[resultGID, ...componentsAfterKey]` (the first GID is the implicit lookup key, NOT included in the components array). Shared `tryLigature(gids, ligatures)` lives in `src/shaping/gsub-driver.ts` and is used by Bengali, Tamil, Devanagari, and Arabic shapers. Each shaper exposes a thin `tryLig(gids)` closure that forwards to the shared driver. +- GPOS MarkBasePos: shared helpers in `src/shaping/gpos-positioner.ts` (`getBaseAnchor`, `getMarkAnchor`, `getMark2MarkAnchor`, `positionMarkOnBase(markAnchors, markGid, baseGid, baseAdv)`). Used by Devanagari and Arabic shapers. Arabic tracks `lastBaseGid` through the shaping pipeline (including lam-alef ligatures) and applies the anchor offset to transparent (joining type 'T') marks; falls back to (0, 0) when font lacks anchors. +- Emoji: monochrome via Noto Emoji (OFL-1.1) under lang `'emoji'`. Detection in `src/shaping/script-registry.ts` (`EMOJI_RANGES`, `isEmojiCodepoint`, `containsEmoji`, `FITZPATRICK_START/END`, `ZWJ`, `VS15`, `VS16`). `detectCharLang(cp)` returns `'emoji'` for emoji codepoints; `splitTextByFont()` routes them to the registered `'emoji'` font automatically. Opt-in via `registerFont('emoji', () => import('pdfnative/fonts/noto-emoji-data.js'))`. COLRv1 colour emoji deferred to v1.2. +- Latin VF (PDF/A): Noto Sans VF (OFL-1.1) bundled as `fonts/noto-sans-data.{js,d.ts}` under lang `'latin'`. Activates automatically for PDF/A documents containing non-WinAnsi Latin (curly quotes, em-dash, ellipsis…). Opt-in via `registerFont('latin', () => import('pdfnative/fonts/noto-sans-data.js'))`. ### API Design @@ -231,7 +235,7 @@ npm run lint # eslint src/ (ESLint 9 + typescript-eslint strict) - **PDF /Info metadata** — Title, Producer (pdfnative), CreationDate in D:YYYYMMDDHHmmss format - **Input validation** — at `buildPDF()` boundary: null/undefined/type checks, 100K row limit - **URL validation** — at `validateURL()`: blocks javascript:, file:, data: schemes -- **95%+ test coverage** — 1606+ tests (41 files), 48 fuzz edge-cases (including recursion/zip-bomb/xref-chain hardening), performance benchmarks +- **95%+ test coverage** — 1726+ tests (48 files), 48 fuzz edge-cases (including recursion/zip-bomb/xref-chain hardening), performance benchmarks - **NPM provenance** — signed builds via GitHub Actions OIDC - Security: no `eval()`, no `Function()`, no dynamic code execution - No `console.log` in library code (only in tools/ and scripts/) diff --git a/.github/workflows/verapdf.yml b/.github/workflows/verapdf.yml index 951a67b..6db4918 100644 --- a/.github/workflows/verapdf.yml +++ b/.github/workflows/verapdf.yml @@ -106,9 +106,9 @@ jobs: - name: Generate sample PDFs run: npm run test:generate - # Pre-release mode for v1.0.4: keep veraPDF visible in CI but do not block merges yet. - # Reason: full PDF/A conformance is completed in v1.0.5 (Latin font embedding work). - # Switch this back to blocking in v1.0.5 by removing `continue-on-error`. - - name: Validate PDF/A claims (non-blocking pre-v1.0.5) - continue-on-error: true + # v1.1.0+: veraPDF is blocking. Every PDF/A-claiming sample (detected + # automatically via pdfaid:part in XMP) must validate against the + # corresponding PDF/A profile (1b, 2b, 2u, 3b). Non-PDF/A files are + # skipped by scripts/validate-pdfa.ts and never trigger failures. + - name: Validate PDF/A claims (blocking) run: npm run validate:pdfa diff --git a/CHANGELOG.md b/CHANGELOG.md index a84ee1a..f6a0f1f 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -7,6 +7,321 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0 ## [Unreleased] +_No unreleased changes._ + +## [1.1.0] – 2026-04-30 + +Maximalist stable cut. Closes issues +[#28](https://github.com/Nizoka/pdfnative/issues/28) (PDF/A Latin font +embedding) and [#25](https://github.com/Nizoka/pdfnative/issues/25) +(UAX #9 isolates + GPOS MarkBasePos for Arabic harakat), and adds +monochrome emoji support. Folds the alpha.1 / alpha.2 medium-term items +into a single stable release. 100% backward-compatible — all new +features are opt-in. **1726 tests / 48 files green.** See full notes in +[release-notes/v1.1.0.md](release-notes/v1.1.0.md). + +### Fixed + +- **core(pdfa):** PDF/A samples no longer reference unembedded + `Helvetica` / `Helvetica-Bold` standard-14 fonts when a Latin font + entry is registered. Object 3 and Object 4 are now emitted as Type0 + redirector dictionaries pointing to the primary embedded font's + `CIDFontType2` / `FontFile2` chain — making `/F1` and `/F2` valid + embedded references for veraPDF (ISO 19005-1 §6.3.4 / ISO 19005-2 + §6.2.11.4.1). Bold renders identical to regular under PDF/A in v1.1.0 + (a future release will add Noto Sans Bold as a separate font module). +- **core(xmp):** XMP metadata streams are now UTF-8 encoded via the new + `utf8EncodeBinaryString()` helper before passing through `toBytes()`. + Previously, `toBytes()` masked each char to `0xFF`, truncating + characters above U+00FF (em-dash, ellipsis, smart quotes, CJK) to + control bytes — which broke ISO 19005-1 §6.7.3 dc:title parity. Now + `` matches `/Info /Title` byte-for-byte. +- **core(xmp):** `buildXMPMetadata()` now emits `` and + `` whenever `/Info /Subject` and `/Info /Keywords` are + set in the document metadata, satisfying ISO 19005-1 §6.7.3 t4 / t5 + parity rules. Previously, PDF/A-1b validation failed with veraPDF + rules 6.7.3-4 and 6.7.3-5 on any document carrying `subject` or + `keywords` metadata. ISO 19005-2/3 was lenient on this and still + passed; v1.1.0 closes both gaps. +- **core(encoding):** `createEncodingContext(fontEntries, pdfA)` accepts + an optional `pdfA` flag. When `true` and `fontEntries` is non-empty, + the WinAnsi/Helvetica fallback in mixed-content runs is disabled — + characters not covered by the primary CIDFont's cmap render as + `.notdef` (gid 0) instead of being routed to the unembedded Helvetica + Type1 font. Required for strict PDF/A conformance. +- **scripts(samples):** `scripts/generators/pdfa-variants.ts` now + registers a `latin` font entry so `tagged-pdfa{1b,2b,2u,3b}.pdf` are + fully embedded (zero `Helvetica` references in the output). + `scripts/generators/pdfa-latin-embedding.ts` math operators paragraph + trimmed to characters covered by Noto Sans VF (number sets ℝ ℂ ℕ ℤ, + basic ops × ÷ ±) — Noto Sans Math support deferred. +- **scripts(samples):** Five additional PDF/A-claiming sample + generators now register a `latin` font entry — `barcode-tagged.pdf`, + `compressed-tagged-pdfa2b.pdf`, `header-footer-tagged.pdf`, + `tagged-accessibility-complex.pdf`, `toc-tagged.pdf`. Closes the + remaining veraPDF rule 6.2.11.4.1-1 (font embedding) failures + reported by CI. +- **core(annot):** Link annotations (`/Subtype /Link`, both `/URI` and + `/GoTo`) and form widget annotations (`/Subtype /Widget`) now emit + `/F 4` (Print flag set, NoView/Hidden/Invisible cleared) per ISO + 19005-2 §6.5.3 / veraPDF rule 6.3.2-1. Required on every annotation + in PDF/A-2 / PDF/A-3. +- **ci(verapdf):** veraPDF validation is now **blocking** on PRs and + pushes to `main` (the previous `continue-on-error: true` was a + pre-v1.0.5 placeholder). `scripts/validate-pdfa.ts` already + auto-detects PDF/A-claiming files via XMP `pdfaid:part`, so non-PDF/A + samples never trigger CI failures. + +### Notes + +- `Helvetica` / `Helvetica-Bold` standard-14 fonts are still emitted in + non-PDF/A mode and in the Latin-only path (no font entries) for + backward compatibility. To produce a strictly veraPDF-compliant + PDF/A, register Noto Sans VF: `registerFont('latin', () => + import('pdfnative/fonts/noto-sans-data.js'))`. +- Noto Emoji uses `defaultWidth=2600` over `unitsPerEm=2048` (≈1.27 em + per glyph), per the font's authoritative metrics. This produces wider + advance than typical Latin fonts in mixed-script paragraphs — visually + correct per the font designer's intent but may look spacious. + +### Added + +- **fonts(latin):** `fonts/noto-sans-data.{js,d.ts}` — Noto Sans VF + (OFL-1.1), 4515 glyphs / 3094 cmap entries. Opt-in via + `registerFont('latin', () => import('pdfnative/fonts/noto-sans-data.js'))`. + Activates automatically for PDF/A documents containing non-WinAnsi + Latin (curly quotes, em-dash, ellipsis…). Closes + [#28](https://github.com/Nizoka/pdfnative/issues/28). +- **fonts(emoji):** `fonts/noto-emoji-data.{js,d.ts}` — Noto Emoji + monochrome (OFL-1.1), 1891 glyphs / 1489 cmap entries. Opt-in via + `registerFont('emoji', () => import('pdfnative/fonts/noto-emoji-data.js'))`. +- **shaping(bidi):** UAX #9 isolate handling — LRI / RLI / FSI / PDI + (U+2066–U+2069) classified as `BN`, recursed via three-tier + dispatcher (`resolveBidiRuns` → `resolveBidiRunsForced` → + `resolveBidiCore`). Nested and unmatched isolates supported. + Closes the syntactic half of [#25](https://github.com/Nizoka/pdfnative/issues/25). +- **shaping(arabic):** GPOS MarkBasePos applied to transparent marks + (harakat: fatha, kasra, damma, sukun, shadda, …). Marks now anchor + on the preceding base glyph. Closes the visual half of + [#25](https://github.com/Nizoka/pdfnative/issues/25). + ([src/shaping/arabic-shaper.ts](src/shaping/arabic-shaper.ts)) +- **shaping(drivers):** new shared `src/shaping/gsub-driver.ts` + (`tryLigature(gids, ligatures)`) and + `src/shaping/gpos-positioner.ts` (`getBaseAnchor`, `getMarkAnchor`, + `getMark2MarkAnchor`, `positionMarkOnBase`). Bengali / Tamil / + Devanagari / Arabic shapers route through these instead of three + duplicated implementations. +- **shaping(emoji):** `EMOJI_RANGES`, `isEmojiCodepoint`, + `containsEmoji`, `FITZPATRICK_START/END`, `ZWJ`, `VS15`, `VS16` in + [src/shaping/script-registry.ts](src/shaping/script-registry.ts). + `detectCharLang()` returns `'emoji'` for emoji codepoints; + `detectFallbackLangs()` adds `'emoji'` to the set automatically. + +### Changed + +- **shaping(bidi):** `resolveBidiRuns()` rewritten as a recursive + isolate-aware dispatcher. Output byte-identical for inputs without + isolate characters. +- **shaping(types):** `fixPunctuationAffinity` and `fixBracketPairing` + parameter types widened to `readonly number[]`. No public API impact. +- **shaping(bengali, tamil, devanagari):** local `tryLigature` + removed; thin `tryLig(gids)` closure forwards to shared driver. + Output bytes unchanged. + +### Tests + +- 24 new tests in + [tests/shaping/phase2-shaping.test.ts](tests/shaping/phase2-shaping.test.ts) + (GSUB driver, GPOS positioner, BiDi isolates, Arabic MarkBasePos). +- 15 new tests in [tests/shaping/emoji.test.ts](tests/shaping/emoji.test.ts) + (ranges, predicates, script-detect integration, baked module shape). +- New PDF/A Latin embedding integration in + [tests/fonts/pdfa-latin-embedding.test.ts](tests/fonts/pdfa-latin-embedding.test.ts). +- Total: **1726 / 1726 green** (48 files), up from 1674. + +### Deferred to v1.2.0 + +- Full UAX #9 embeddings (LRE / RLE / LRO / RLO / PDF) — + isolates ship now; embeddings remain rare in practice. +- True page-by-page constant-memory streaming + (`buildDocumentPDFStreamPageByPage()`). +- COLRv1 colour emoji (v1.1.0 ships monochrome only). + +## [1.1.0-alpha.2] – 2026-04-29 + +This iteration extends alpha.1 with two contained, fully-tested table-layout +features that were on the v1.1.0 medium-term list, plus a small UX polish to +the documentation site. The remaining epics (issue +[#28](https://github.com/Nizoka/pdfnative/issues/28) PDF/A Latin font +embedding, issue [#25](https://github.com/Nizoka/pdfnative/issues/25) full +UAX #9 + multi-pass GSUB + GPOS MarkBasePos) and emoji support stay scheduled +for v1.1.0 stable. True page-by-page constant-memory streaming is deferred +to v1.2.0 because it requires an architectural refactor of `pdf-document.ts` +that we don't want to ship under alpha-velocity. + +### Added + +- **core(table):** `TableBlock.clipCells?: boolean` (default `true`) — + every header and data cell is now wrapped in `q re W n ... Q` so + variable-width content cannot escape its column rectangle visually. The + existing character-cap (`ColumnDef.mx` / `mxH`) and clipping operate + in tandem; opt out with `clipCells: false` for byte-identical v1.0.x + output. ([src/core/pdf-renderers.ts](src/core/pdf-renderers.ts)) +- **core(table):** `TableBlock.autoFitColumns?: boolean` — when `true`, + column-width fractions are derived from actual measured content widths + (header at `fs.th`, cells at `fs.td`, plus 6 pt cell padding). The + resulting fractions are forwarded to `computeColumnPositions()` which + still honours per-column `minWidth` / `maxWidth` clamping. Defaults + to `false` for byte-stability. ([src/core/pdf-column-fit.ts](src/core/pdf-column-fit.ts)) +- **docs(site):** added live `pdfnative-mcp` npm version badge in the + hero badge strip, mirroring the existing `pdfnative-cli` badge. +- **docs(site):** new compact one-line **live version strip** mounted + directly under the main ` + +
+
@@ -99,6 +102,7 @@

Pure Native PDF Generation

npm provenance signed MIT License pdfnative-cli npm version + pdfnative-mcp npm version
npm install pdfnative @@ -647,6 +651,7 @@

Born from Production Needs