Skip to content

release: v0.3.53 — Java binding (8th) + OCR parity + markdown-extraction quality pass#533

Merged
yfedoseev merged 1 commit into
mainfrom
release/v0.3.53
May 23, 2026
Merged

release: v0.3.53 — Java binding (8th) + OCR parity + markdown-extraction quality pass#533
yfedoseev merged 1 commit into
mainfrom
release/v0.3.53

Conversation

@yfedoseev
Copy link
Copy Markdown
Owner

@yfedoseev yfedoseev commented May 21, 2026

[0.3.53] — 2026-05-22

Java is the 8th binding, plus a markdown-extraction quality pass
and OCR parity across every prebuilt. Native Maven-Central artifact
on jni-rs 0.22 (JDK 11+, five-arch fat JAR), full v0.3.52 surface
parity. Published Python wheels and the Java JAR now ship OCR
(parity with Node / Go / C#). Markdown fixes: table-cell bold/italic
preserved, CamelCase brand names no longer split, spatial cell words
no longer fragment into columns, centered titles read in order.

Java binding (fyi.oxide:pdf-oxide:0.3.53)

Native JNI binding on jni-rs 0.22, JDK 11 LTS floor, five-arch fat JAR
(linux x86_64/aarch64, macOS x86_64/aarch64, windows x86_64). Full
v0.3.52 surface parity across text / markdown / AutoExtractor / forms
/ render / PAdES B-B+B-T+B-LT / destructive redaction /
split-by-bookmarks / compliance / crypto-policy. Free Kotlin interop.
New pdf_oxide_jni crate; CI java + fips-java jobs; release
build-java-native + package-java-jar + publish-maven
(autoPublish=false per the release gate). 52 JNI symbols, 9 classes,
82 JUnit tests.

OCR parity across all prebuilts

Published Python wheels (glibc + musl) and the Java JAR now build with
OCR — previously CI tested OCR but release.yml shipped wheels
without it. FIPS variants deliberately exclude OCR.

Markdown-extraction quality pass (root-cause fixes + TDD)

Every reading-order/table change was gated by a 70-PDF
baseline-vs-HEAD sweep:

  • Table cells preserve bold/italic (table_extractor populates
    cell.spans).
  • CamelCase brand names no longer split ("SalesForce", not "SalesF
    orce") — TJ-kerning misread repaired, ASCII signature only;
    all-caps/acronyms untouched.
  • Spatial cell words no longer fragment into per-word columns
    (row-coverage filter, gated so it never fabricates tables).
  • Centered titles read in order (XY-cut centered-block guard).
  • Fewer fragmented headings; KPI numeric runs collapse; stray pipes
    escaped.
  • Content-preservation policy: post-processing never drops/rewrites
    legitimate text (band-aids that did were removed after the sweep
    caught real-document damage).

Review nits (this PR)

Doc accuracy (Cleaner backstop, arch counts), PageClass↔PageKind
parity, dead-code, stale Javadoc.

CI / Release hygiene

Composite free-disk-space action (fixes the Code Coverage OOM);
macOS FIPS Java deferred (documented).

Known issue

Tight two-column prose bodies can still interleave in reading
order (#534) — deferred; needs a table-vs-prose classifier. Two fix
attempts were reverted after the sweep caught table-data corruption.


Test plan

  • Full Rust lib suite green (5,386 pass / 0 fail / 2 ignored)
  • spatial_table_detector + xycut + extractors::text suites green
  • 70-PDF baseline(0.3.52)-vs-HEAD sweep: text/md/html diffs all
    classified improvement or pre-existing-flaky; no regressions
  • Reporter issue PDFs validated on the HEAD build
  • CI green on the squashed commit
  • Maven Central upload reaches VALIDATED (maintainer flips Publish)

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR prepares the v0.3.53 release by adding a new first-class Java/JNI binding (plus tests and CI integration) and bumping all published package versions across the workspace and other language artifacts.

Changes:

  • Introduces the pdf_oxide_jni Rust workspace crate (cdylib) and a full Java API surface (fyi.oxide.pdf.*) with JUnit coverage.
  • Adds/extends GitHub Actions jobs to build/test the Java binding (including a FIPS-mode build of the JNI shim).
  • Bumps versions to 0.3.53 across Cargo workspace crates, Python, Node, and C# packaging, and updates top-level README language list.

Reviewed changes

Copilot reviewed 114 out of 115 changed files in this pull request and generated 11 comments.

Show a summary per file
File Description
README.md Adds Java to the bindings list and provides Maven/Gradle install snippet.
pyproject.toml Bumps Python package version to 0.3.53.
Cargo.toml Adds pdf_oxide_jni to workspace members; bumps crate version to 0.3.53.
Cargo.lock Locks new JNI-related dependencies and bumps workspace crate versions.
pdf_oxide_cli/Cargo.toml Bumps CLI crate + dependency version to 0.3.53.
pdf_oxide_mcp/Cargo.toml Bumps MCP crate + dependency version to 0.3.53.
js/package.json Bumps Node package version to 0.3.53.
csharp/PdfOxide/PdfOxide.csproj Bumps .NET package version to 0.3.53.
.github/workflows/ci.yml Adds build/test flow for Java binding; adds java-jni build-lib variant.
.github/workflows/ci-fips.yml Adds FIPS build/test job for Java JNI shim.
java/.gitignore Ignores Maven output and staged native resource libs.
java/pom.xml Maven build + test configuration for the Java binding artifact.
java/README.md Java/Kotlin binding documentation and usage examples.
java/src/main/java/fyi/oxide/pdf/internal/NativeLoader.java Loads embedded native libraries from the jar at runtime.
java/src/main/java/fyi/oxide/pdf/PdfDocument.java Core Java document API (open/extract/render/search/etc.).
java/src/main/java/fyi/oxide/pdf/PdfPage.java Per-page Java API delegating to document handle.
java/src/main/java/fyi/oxide/pdf/MarkdownConverter.java Markdown/HTML conversion façade.
java/src/main/java/fyi/oxide/pdf/Pdf.java PDF creation + split-by-bookmarks API.
java/src/main/java/fyi/oxide/pdf/DocumentEditor.java Write/edit surface (forms/redaction/save).
java/src/main/java/fyi/oxide/pdf/AutoExtractor.java Auto-extraction/classification surface and JSON escape hatch.
java/src/main/java/fyi/oxide/pdf/PdfSigner.java Java signature API (PAdES).
java/src/main/java/fyi/oxide/pdf/PdfValidator.java Java compliance validator façade (PDF/A, PDF/UA).
java/src/main/java/fyi/oxide/pdf/PdfPolicy.java Java crypto policy façade (set-once).
java/src/main/java/fyi/oxide/pdf/annotation/Annotation.java Annotation value type.
java/src/main/java/fyi/oxide/pdf/annotation/AnnotationType.java Annotation subtype enum.
java/src/main/java/fyi/oxide/pdf/auto/AutoResult.java Auto-extraction result value type.
java/src/main/java/fyi/oxide/pdf/auto/ClassifyResult.java Auto-extractor classification result value type.
java/src/main/java/fyi/oxide/pdf/auto/ExtractMode.java Auto-extractor mode enum.
java/src/main/java/fyi/oxide/pdf/auto/ExtractReason.java Degradation reason enum.
java/src/main/java/fyi/oxide/pdf/auto/PageClass.java Page classification enum for classify APIs.
java/src/main/java/fyi/oxide/pdf/auto/RegionResult.java Per-region auto-extraction value type.
java/src/main/java/fyi/oxide/pdf/compliance/PdfALevel.java PDF/A level enum.
java/src/main/java/fyi/oxide/pdf/compliance/PdfUaLevel.java PDF/UA level enum.
java/src/main/java/fyi/oxide/pdf/compliance/PdfXLevel.java PDF/X level enum (placeholder surface).
java/src/main/java/fyi/oxide/pdf/compliance/ValidationResult.java Validation result value type.
java/src/main/java/fyi/oxide/pdf/compliance/ValidationViolation.java Validation violation value type.
java/src/main/java/fyi/oxide/pdf/exception/PdfException.java Base unchecked exception type.
java/src/main/java/fyi/oxide/pdf/exception/PdfErrorKind.java Error-kind enum for dispatch.
java/src/main/java/fyi/oxide/pdf/exception/PdfEncryptedException.java Typed exception subtype.
java/src/main/java/fyi/oxide/pdf/exception/PdfInvalidStateException.java Typed exception subtype.
java/src/main/java/fyi/oxide/pdf/exception/PdfIoException.java Typed exception subtype.
java/src/main/java/fyi/oxide/pdf/exception/PdfOcrUnavailableException.java Typed exception subtype.
java/src/main/java/fyi/oxide/pdf/exception/PdfParseException.java Typed exception subtype.
java/src/main/java/fyi/oxide/pdf/exception/PdfPermissionException.java Typed exception subtype.
java/src/main/java/fyi/oxide/pdf/exception/PdfSignatureException.java Typed exception subtype.
java/src/main/java/fyi/oxide/pdf/exception/PdfUnsupportedException.java Typed exception subtype.
java/src/main/java/fyi/oxide/pdf/form/FormField.java Form-field value type.
java/src/main/java/fyi/oxide/pdf/form/FormFieldType.java Form-field type enum.
java/src/main/java/fyi/oxide/pdf/geometry/BBox.java Geometry value type.
java/src/main/java/fyi/oxide/pdf/geometry/Color.java Geometry value type.
java/src/main/java/fyi/oxide/pdf/geometry/Point.java Geometry value type.
java/src/main/java/fyi/oxide/pdf/geometry/Rect.java Geometry value type.
java/src/main/java/fyi/oxide/pdf/image/ExtractedImage.java Extracted image value type.
java/src/main/java/fyi/oxide/pdf/image/ImageFormat.java Image format enum.
java/src/main/java/fyi/oxide/pdf/metadata/DocumentInfo.java Info-dict value type.
java/src/main/java/fyi/oxide/pdf/metadata/XmpMetadata.java XMP metadata value type.
java/src/main/java/fyi/oxide/pdf/policy/PolicyMode.java Policy-mode enum.
java/src/main/java/fyi/oxide/pdf/policy/SecurityPolicy.java Policy value type + builder.
java/src/main/java/fyi/oxide/pdf/redaction/RedactResult.java Redaction result value type.
java/src/main/java/fyi/oxide/pdf/render/PixelFormat.java Rendering pixel format enum.
java/src/main/java/fyi/oxide/pdf/search/SearchMatch.java Search match value type.
java/src/main/java/fyi/oxide/pdf/search/SearchOptions.java Search options value type + builder.
java/src/main/java/fyi/oxide/pdf/search/SearchResult.java Search result wrapper value type.
java/src/main/java/fyi/oxide/pdf/signature/SignOptions.java Signing options value type + builder.
java/src/main/java/fyi/oxide/pdf/signature/SignatureLevel.java PAdES level enum.
java/src/main/java/fyi/oxide/pdf/split/BookmarkSegment.java Split segment metadata value type.
java/src/main/java/fyi/oxide/pdf/split/SplitByBookmarksOptions.java Split-by-bookmarks options + builder.
java/src/main/java/fyi/oxide/pdf/table/Table.java Table value type.
java/src/main/java/fyi/oxide/pdf/table/TableCell.java Table cell value type.
java/src/main/java/fyi/oxide/pdf/text/TextChar.java Text value type.
java/src/main/java/fyi/oxide/pdf/text/TextLine.java Text value type.
java/src/main/java/fyi/oxide/pdf/text/TextSpan.java Text value type.
java/src/main/java/fyi/oxide/pdf/text/TextStyle.java Text value type.
java/src/main/java/fyi/oxide/pdf/text/TextWord.java Text value type.
java/src/test/java/fyi/oxide/pdf/PdfDocumentTest.java JUnit coverage for core document behaviors.
java/src/test/java/fyi/oxide/pdf/PdfPageTest.java JUnit coverage for page APIs.
java/src/test/java/fyi/oxide/pdf/MarkdownConverterTest.java JUnit coverage for markdown/html conversion.
java/src/test/java/fyi/oxide/pdf/PdfCreationTest.java JUnit coverage for PDF creation APIs.
java/src/test/java/fyi/oxide/pdf/SplitTest.java JUnit coverage for split-by-bookmarks.
java/src/test/java/fyi/oxide/pdf/RenderTest.java JUnit coverage for rendering APIs.
java/src/test/java/fyi/oxide/pdf/DocumentEditorTest.java JUnit coverage for editor APIs (save/redaction/forms).
java/src/test/java/fyi/oxide/pdf/PdfPolicyTest.java JUnit coverage for set-once crypto policy.
java/src/test/java/fyi/oxide/pdf/PdfValidatorTest.java JUnit coverage for compliance checks.
java/src/test/java/fyi/oxide/pdf/PdfSignerTest.java JUnit coverage for signature classification behavior.
java/src/test/java/fyi/oxide/pdf/PdfSignerSignIntegrationTest.java Integration tests for signing (incl. TSA-gated).
java/src/test/java/fyi/oxide/pdf/geometry/GeometryTest.java Pure-Java tests for geometry types.
java/src/test/java/fyi/oxide/pdf/exception/ExceptionHierarchyTest.java Pure-Java tests for exception taxonomy.
pdf_oxide_jni/Cargo.toml New JNI shim crate definition + feature mirroring.
pdf_oxide_jni/README.md JNI shim build/packaging documentation.
pdf_oxide_jni/src/lib.rs JNI shim module layout + JNI_OnLoad/OnUnload.
pdf_oxide_jni/src/error.rs Rust→Java exception mapping + throw helpers.
pdf_oxide_jni/src/pdf_document.rs JNI entrypoints backing PdfDocument.
pdf_oxide_jni/src/pdf_page.rs JNI entrypoints backing PdfPage.
pdf_oxide_jni/src/markdown.rs JNI entrypoints for MarkdownConverter.
pdf_oxide_jni/src/search.rs JNI entrypoints for text search.
pdf_oxide_jni/src/forms.rs JNI entrypoints for form field extraction.
pdf_oxide_jni/src/annotations.rs JNI entrypoints for annotations extraction.
pdf_oxide_jni/src/auto_extractor.rs JNI entrypoints for classification + JSON extraction.
pdf_oxide_jni/src/pdf.rs JNI entrypoints for PDF creation/fromImages + save/close.
pdf_oxide_jni/src/split.rs JNI entrypoints for split-by-bookmarks.
pdf_oxide_jni/src/policy.rs JNI entrypoints for crypto policy.
pdf_oxide_jni/src/validator.rs JNI entrypoints for PDF/A + PDF/UA boolean checks.
pdf_oxide_jni/src/signatures_pades.rs JNI entrypoints for signing/classification (PAdES).
pdf_oxide_jni/src/editor.rs JNI entrypoints for DocumentEditor.
pdf_oxide_jni/src/render.rs JNI entrypoints for rendering (feature-gated).
pdf_oxide_jni/src/text.rs Placeholder module for future text marshalling consolidation.
pdf_oxide_jni/src/images.rs Placeholder module for future image marshalling consolidation.
pdf_oxide_jni/src/metadata.rs Placeholder module for future metadata marshalling consolidation.
pdf_oxide_jni/src/attachments.rs Placeholder module for future attachments marshalling.
pdf_oxide_jni/src/dom.rs Placeholder module for future DOM surface marshalling.
pdf_oxide_jni/src/compliance.rs Placeholder module for future full ValidationResult marshalling.
pdf_oxide_jni/src/redaction.rs Placeholder module for future redaction marshalling consolidation.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread java/src/main/java/fyi/oxide/pdf/DocumentEditor.java Outdated
Comment thread java/src/main/java/fyi/oxide/pdf/DocumentEditor.java Outdated
Comment thread java/README.md
Comment thread pdf_oxide_jni/src/auto_extractor.rs
Comment thread java/src/main/java/fyi/oxide/pdf/auto/PageClass.java
Comment thread .github/workflows/ci.yml Outdated
Comment thread .github/workflows/ci-fips.yml Outdated
Comment thread pdf_oxide_jni/src/pdf.rs
Comment thread pdf_oxide_jni/src/annotations.rs
Comment thread java/src/main/java/fyi/oxide/pdf/PdfPage.java Outdated
@yfedoseev yfedoseev force-pushed the release/v0.3.53 branch 8 times, most recently from de3dfc3 to 5557d5a Compare May 21, 2026 07:10
yfedoseev added a commit that referenced this pull request May 21, 2026
Address Copilot review nits on PR #533 that are safe to land pre-Maven-
publish:

- pdf_oxide_jni/src/annotations.rs: remove dead `(x0, y0, x1, y1)`
  destructure left over from a borrow-checker workaround.
- java/PdfPage.java: drop stale "stubbed to UnsupportedOperation"
  Javadoc — nativeWords/Lines/Chars are implemented in pdf_oxide_jni
  since #519.
- java/auto/PageClass: trim CHART/ENCRYPTED enum values that the Rust
  PageKind never produces. Chart/encrypted-permission states already
  surface through ExtractReason.CHART_NOT_TRANSCRIBED and
  ENCRYPTED_NO_EXTRACT_PERMISSION. Lands now to avoid an ordinal-
  breaking change after Maven Central publish.
- pdf_oxide_jni/src/auto_extractor.rs: tighten the ordinal-mapping
  comment to match the trimmed Java enum.
- java/DocumentEditor + java/README "Lifecycle": stop claiming a
  Cleaner backstop exists for DocumentEditor / Pdf — only PdfDocument
  registers a Cleaner. Doc updated to require explicit close() for
  the other two.
- pdf_oxide_jni/README + .github/workflows/ci.yml: correct the
  "six arches" claim — the JAR bundles five (linux x86_64/aarch64,
  macOS x86_64/aarch64, windows x86_64); no musl native ships.
- .github/workflows/ci-fips.yml: refine the macos-latest deferral
  comment with what the previous CI run actually showed (dylib
  present at correct path, bare UnsatisfiedLinkError from
  System.load on JDK 11 macos-15 aarch64 — likely aws-lc-fips
  runtime symbol or Hardened-Runtime issue, not a path/extension
  bug). Tracks the follow-up investigation rather than a hardware-
  specific guess.
yfedoseev added a commit that referenced this pull request May 21, 2026
Root-cause of the v0.3.53 PR #533 Code Coverage job failure: the
job called jlumbroso/free-disk-space with default settings, which
includes `swap-storage: true` — i.e. removes the runner's 4 GB
swapfile. cargo-llvm-cov's instrumented build (3× larger than the
normal release build) then OOM-killed mid-link, and the runner's
"Generate coverage" step died with no completed status. Every other
free-disk-space callsite in the repo set `swap-storage: false`
explicitly, with a comment warning about exactly this failure mode
("removing swap causes OOM-induced SIGBUS in the linker"). The
Coverage callsite missed the override — copy-paste drift.

Fix holistically rather than patching one line:

- New composite action `.github/actions/free-disk-space/`:
  - `swap-storage: false` is locked in (not exposed as an input)
  - `aggressive` input controls large-packages (default true)
  - `tool-cache` input (default true) — set to false in jobs that
    need setup-python / setup-node hits
  - `df -h` diagnostics before/after so the next disk-pressure
    regression is visible in the run log
  - jlumbroso/free-disk-space pinned to its current SHA (was @main)

- All 6 callsites converted to `uses: ./.github/actions/free-disk-space`:
  - ci.yml Test job (was: tool-cache:true, large-packages:true)
  - ci.yml Python wheel job (tool-cache: 'false')
  - ci.yml WASM Build (tool-cache: 'false')
  - ci.yml Code Coverage — bug fix lands here (was missing
    swap-storage:false; composite now locks it in)
  - python.yml Python tests (defaults)
  - release.yml wheel build (tool-cache: 'false')

Net change: -42 lines of duplicated YAML, one source of truth for
runner-disk policy, OOM lesson encoded in the action so future
callsites can't drift back. df -h is now printed at every reclaim
so the next disk regression is debuggable in the run log without
needing local repro.
yfedoseev added a commit that referenced this pull request May 22, 2026
…n quality pass

Released 2026-05-22.

## Java is the 8th binding (fyi.oxide:pdf-oxide:0.3.53)
Native Maven-Central JNI binding on jni-rs 0.22, JDK 11 LTS floor,
five-arch fat JAR (linux x86_64/aarch64, macOS x86_64/aarch64,
windows x86_64). Full v0.3.52 surface parity across text / markdown /
AutoExtractor / forms / render / PAdES B-B+B-T+B-LT / destructive
redaction / split-by-bookmarks / compliance / crypto-policy. Free
Kotlin interop via the same JAR. New `pdf_oxide_jni` workspace crate;
CI `java` + `fips-java` jobs; release `build-java-native` +
`package-java-jar` + `publish-maven` (autoPublish=false per the
release gate). 52 JNI symbols, 9 wired classes, 82 JUnit tests.

## OCR parity across all prebuilts
The published Python wheels (glibc + musl) and the Java JAR now build
with OCR — previously CI tested `--features python,ocr,barcodes` but
release.yml shipped `--features python`, so PyPI users got no OCR.
Java JNI now builds the full ocr,rendering,signatures,barcodes,
tsa-client,system-fonts set, matching the Node/Go/C# native cdylib.
FIPS variants deliberately exclude OCR.

## Markdown-extraction quality pass
Root-cause fixes (with regression tests + a 70-PDF baseline-vs-HEAD
sweep gating every reading-order/table change):
- Table cells preserve bold/italic — tagged-PDF table_extractor now
  populates cell.spans instead of joined-text-only.
- CamelCase brand names no longer split ("SalesForce", not
  "SalesF orce") — repairs TJ-kerning misread as a word space, ASCII
  lower→UPPER signature in sparse-width spans only; all-caps and
  acronyms untouched.
- Spatial cell words no longer fragment into per-word columns —
  row-coverage phantom-column filter, gated so it only refines an
  already-detected table and never fabricates one from prose.
- Centered titles read in document order — centered-block guard in
  XY-cut keeps small centered blocks single-column.
- Fewer fragmented headings (word-per-heading + wrapped); KPI
  numeric-only heading runs collapse to a list; stray pipes escaped.
- Content-preservation policy: post-processing never drops/rewrites
  legitimate text. Band-aids that filtered page-numbers, rewrote
  bullet codepoints, flattened sparse-real tables, or deduped
  repeated content were removed after the sweep proved they damaged
  real documents.

## Review nits (PR #533)
Doc accuracy (DocumentEditor/Pdf Cleaner backstop, arch counts),
PageClass enum parity with Rust PageKind, annotations.rs dead-code,
PdfPage stale Javadoc.

## CI / Release hygiene
- Composite action .github/actions/free-disk-space (single source of
  truth; swap-storage:false locked in; df -h diagnostics) replaces 6
  drifted callsites — fixes the Code Coverage OOM on PR #533.
- macOS FIPS Java deferred (documented UnsatisfiedLinkError).

## Known issue
Tight two-column PROSE bodies can still interleave in reading order
(#534). A safe fix needs a table-vs-prose classifier; two attempts
(valley-threshold + structural detector) were reverted after the
sweep caught table-data corruption — both documented inline in
xycut.rs.
@yfedoseev yfedoseev changed the title release: v0.3.53 — Java is the 8th binding (jni-rs 0.22, JDK 11+, fyi.oxide:pdf-oxide) release: v0.3.53 — Java binding (8th) + OCR parity + markdown-extraction quality pass May 22, 2026
@yfedoseev yfedoseev requested a review from Copilot May 22, 2026 03:31
yfedoseev added a commit that referenced this pull request May 22, 2026
…n quality pass

Released 2026-05-22.

## Java is the 8th binding (fyi.oxide:pdf-oxide:0.3.53)
Native Maven-Central JNI binding on jni-rs 0.22, JDK 11 LTS floor,
five-arch fat JAR (linux x86_64/aarch64, macOS x86_64/aarch64,
windows x86_64). Full v0.3.52 surface parity across text / markdown /
AutoExtractor / forms / render / PAdES B-B+B-T+B-LT / destructive
redaction / split-by-bookmarks / compliance / crypto-policy. Free
Kotlin interop via the same JAR. New `pdf_oxide_jni` workspace crate;
CI `java` + `fips-java` jobs; release `build-java-native` +
`package-java-jar` + `publish-maven` (autoPublish=false per the
release gate). 52 JNI symbols, 9 wired classes, 82 JUnit tests.

## OCR parity across all prebuilts
The published Python wheels (glibc + musl) and the Java JAR now build
with OCR — previously CI tested `--features python,ocr,barcodes` but
release.yml shipped `--features python`, so PyPI users got no OCR.
Java JNI now builds the full ocr,rendering,signatures,barcodes,
tsa-client,system-fonts set, matching the Node/Go/C# native cdylib.
FIPS variants deliberately exclude OCR.

## Markdown-extraction quality pass
Root-cause fixes (with regression tests + a 70-PDF baseline-vs-HEAD
sweep gating every reading-order/table change):
- Table cells preserve bold/italic — tagged-PDF table_extractor now
  populates cell.spans instead of joined-text-only.
- CamelCase brand names no longer split ("SalesForce", not
  "SalesF orce") — repairs TJ-kerning misread as a word space, ASCII
  lower→UPPER signature in sparse-width spans only; all-caps and
  acronyms untouched.
- Spatial cell words no longer fragment into per-word columns —
  row-coverage phantom-column filter, gated so it only refines an
  already-detected table and never fabricates one from prose.
- Centered titles read in document order — centered-block guard in
  XY-cut keeps small centered blocks single-column.
- Fewer fragmented headings (word-per-heading + wrapped); KPI
  numeric-only heading runs collapse to a list; stray pipes escaped.
- Content-preservation policy: post-processing never drops/rewrites
  legitimate text. Band-aids that filtered page-numbers, rewrote
  bullet codepoints, flattened sparse-real tables, or deduped
  repeated content were removed after the sweep proved they damaged
  real documents.

## Review nits (PR #533)
Doc accuracy (DocumentEditor/Pdf Cleaner backstop, arch counts),
PageClass enum parity with Rust PageKind, annotations.rs dead-code,
PdfPage stale Javadoc.

## CI / Release hygiene
- Composite action .github/actions/free-disk-space (single source of
  truth; swap-storage:false locked in; df -h diagnostics) replaces 6
  drifted callsites — fixes the Code Coverage OOM on PR #533.
- macOS FIPS Java deferred (documented UnsatisfiedLinkError).

## Known issue
Tight two-column PROSE bodies can still interleave in reading order
(#534). A safe fix needs a table-vs-prose classifier; two attempts
(valley-threshold + structural detector) were reverted after the
sweep caught table-data corruption — both documented inline in
xycut.rs.
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 122 out of 123 changed files in this pull request and generated 5 comments.

Comment thread java/src/main/java/fyi/oxide/pdf/MarkdownConverter.java Outdated
Comment thread pdf_oxide_jni/src/auto_extractor.rs
Comment thread pdf_oxide_jni/src/split.rs Outdated
Comment thread java/src/main/java/fyi/oxide/pdf/geometry/Color.java
Comment thread pdf_oxide_jni/src/auto_extractor.rs
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 122 out of 123 changed files in this pull request and generated 4 comments.

Comment thread src/structure/table_extractor.rs
Comment thread pdf_oxide_jni/src/auto_extractor.rs Outdated
Comment thread pdf_oxide_jni/src/auto_extractor.rs Outdated
Comment thread pdf_oxide_jni/src/render.rs
yfedoseev added a commit that referenced this pull request May 22, 2026
…n quality pass

Released 2026-05-22.

## Java is the 8th binding (fyi.oxide:pdf-oxide:0.3.53)
Native Maven-Central JNI binding on jni-rs 0.22, JDK 11 LTS floor,
five-arch fat JAR (linux x86_64/aarch64, macOS x86_64/aarch64,
windows x86_64). Full v0.3.52 surface parity across text / markdown /
AutoExtractor / forms / render / PAdES B-B+B-T+B-LT / destructive
redaction / split-by-bookmarks / compliance / crypto-policy. Free
Kotlin interop via the same JAR. New `pdf_oxide_jni` workspace crate;
CI `java` + `fips-java` jobs; release `build-java-native` +
`package-java-jar` + `publish-maven` (autoPublish=false per the
release gate). 52 JNI symbols, 9 wired classes, 82 JUnit tests.

## OCR parity across all prebuilts
The published Python wheels (glibc + musl) and the Java JAR now build
with OCR — previously CI tested `--features python,ocr,barcodes` but
release.yml shipped `--features python`, so PyPI users got no OCR.
Java JNI now builds the full ocr,rendering,signatures,barcodes,
tsa-client,system-fonts set, matching the Node/Go/C# native cdylib.
FIPS variants deliberately exclude OCR.

## Markdown-extraction quality pass
Root-cause fixes (with regression tests + a 70-PDF baseline-vs-HEAD
sweep gating every reading-order/table change):
- Table cells preserve bold/italic — tagged-PDF table_extractor now
  populates cell.spans instead of joined-text-only.
- CamelCase brand names no longer split ("SalesForce", not
  "SalesF orce") — repairs TJ-kerning misread as a word space, ASCII
  lower→UPPER signature in sparse-width spans only; all-caps and
  acronyms untouched.
- Spatial cell words no longer fragment into per-word columns —
  row-coverage phantom-column filter, gated so it only refines an
  already-detected table and never fabricates one from prose.
- Centered titles read in document order — centered-block guard in
  XY-cut keeps small centered blocks single-column.
- Fewer fragmented headings (word-per-heading + wrapped); KPI
  numeric-only heading runs collapse to a list; stray pipes escaped.
- Content-preservation policy: post-processing never drops/rewrites
  legitimate text. Band-aids that filtered page-numbers, rewrote
  bullet codepoints, flattened sparse-real tables, or deduped
  repeated content were removed after the sweep proved they damaged
  real documents.

## Review nits (PR #533)
Doc accuracy (DocumentEditor/Pdf Cleaner backstop, arch counts),
PageClass enum parity with Rust PageKind, annotations.rs dead-code,
PdfPage stale Javadoc.

## CI / Release hygiene
- Composite action .github/actions/free-disk-space (single source of
  truth; swap-storage:false locked in; df -h diagnostics) replaces 6
  drifted callsites — fixes the Code Coverage OOM on PR #533.
- macOS FIPS Java deferred (documented UnsatisfiedLinkError).

## Known issue
Tight two-column PROSE bodies can still interleave in reading order
(#534). A safe fix needs a table-vs-prose classifier; two attempts
(valley-threshold + structural detector) were reverted after the
sweep caught table-data corruption — both documented inline in
xycut.rs.
yfedoseev added a commit that referenced this pull request May 22, 2026
…n quality pass

Released 2026-05-22.

## Java is the 8th binding (fyi.oxide:pdf-oxide:0.3.53)
Native Maven-Central JNI binding on jni-rs 0.22, JDK 11 LTS floor,
five-arch fat JAR (linux x86_64/aarch64, macOS x86_64/aarch64,
windows x86_64). Full v0.3.52 surface parity across text / markdown /
AutoExtractor / forms / render / PAdES B-B+B-T+B-LT / destructive
redaction / split-by-bookmarks / compliance / crypto-policy. Free
Kotlin interop via the same JAR. New `pdf_oxide_jni` workspace crate;
CI `java` + `fips-java` jobs; release `build-java-native` +
`package-java-jar` + `publish-maven` (autoPublish=false per the
release gate). 52 JNI symbols, 9 wired classes, 82 JUnit tests.

## OCR parity across all prebuilts
The published Python wheels (glibc + musl) and the Java JAR now build
with OCR — previously CI tested `--features python,ocr,barcodes` but
release.yml shipped `--features python`, so PyPI users got no OCR.
Java JNI now builds the full ocr,rendering,signatures,barcodes,
tsa-client,system-fonts set, matching the Node/Go/C# native cdylib.
FIPS variants deliberately exclude OCR.

## Markdown-extraction quality pass
Root-cause fixes (with regression tests + a 70-PDF baseline-vs-HEAD
sweep gating every reading-order/table change):
- Table cells preserve bold/italic — tagged-PDF table_extractor now
  populates cell.spans instead of joined-text-only.
- CamelCase brand names no longer split ("SalesForce", not
  "SalesF orce") — repairs TJ-kerning misread as a word space, ASCII
  lower→UPPER signature in sparse-width spans only; all-caps and
  acronyms untouched.
- Spatial cell words no longer fragment into per-word columns —
  row-coverage phantom-column filter, gated so it only refines an
  already-detected table and never fabricates one from prose.
- Centered titles read in document order — centered-block guard in
  XY-cut keeps small centered blocks single-column.
- Fewer fragmented headings (word-per-heading + wrapped); KPI
  numeric-only heading runs collapse to a list; stray pipes escaped.
- Content-preservation policy: post-processing never drops/rewrites
  legitimate text. Band-aids that filtered page-numbers, rewrote
  bullet codepoints, flattened sparse-real tables, or deduped
  repeated content were removed after the sweep proved they damaged
  real documents.

## Review nits (PR #533)
Doc accuracy (DocumentEditor/Pdf Cleaner backstop, arch counts),
PageClass enum parity with Rust PageKind, annotations.rs dead-code,
PdfPage stale Javadoc.

## CI / Release hygiene
- Composite action .github/actions/free-disk-space (single source of
  truth; swap-storage:false locked in; df -h diagnostics) replaces 6
  drifted callsites — fixes the Code Coverage OOM on PR #533.
- macOS FIPS Java deferred (documented UnsatisfiedLinkError).

## Known issue
Tight two-column PROSE bodies can still interleave in reading order
(#534). A safe fix needs a table-vs-prose classifier; two attempts
(valley-threshold + structural detector) were reverted after the
sweep caught table-data corruption — both documented inline in
xycut.rs.
@yfedoseev yfedoseev requested a review from Copilot May 22, 2026 05:47
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 122 out of 123 changed files in this pull request and generated 3 comments.

Comment thread java/src/main/java/fyi/oxide/pdf/MarkdownConverter.java Outdated
Comment thread src/structure/table_extractor.rs
Comment thread pdf_oxide_jni/src/split.rs Outdated
yfedoseev added a commit that referenced this pull request May 22, 2026
- table_extractor: carry the inter-block space into the synthesized
  cell spans so the markdown/HTML table renderers (which reconstruct
  spacing from spans, not cell_text) don't glue tokens across wrapped
  lines; add a test asserting bold/italic + spacing propagation on the
  tagged-PDF MCID->TextBlock path.
- auto_extractor JNI: build the serde-error JSON fallback via
  serde_json so a failure message can't emit invalid JSON.
- split JNI: drop the unused jbyteArray import and the dead _UNUSED
  const that only kept it alive.
- MarkdownConverter: remove the unused PdfInvalidStateException import.
…n quality pass

Released 2026-05-22.

## Java is the 8th binding (fyi.oxide:pdf-oxide:0.3.53)
Native Maven-Central JNI binding on jni-rs 0.22, JDK 11 LTS floor,
five-arch fat JAR (linux x86_64/aarch64, macOS x86_64/aarch64,
windows x86_64). Full v0.3.52 surface parity across text / markdown /
AutoExtractor / forms / render / PAdES B-B+B-T+B-LT / destructive
redaction / split-by-bookmarks / compliance / crypto-policy. Free
Kotlin interop via the same JAR. New `pdf_oxide_jni` workspace crate;
CI `java` + `fips-java` jobs; release `build-java-native` +
`package-java-jar` + `publish-maven` (autoPublish=false per the
release gate). 52 JNI symbols, 9 wired classes, 82 JUnit tests.

## OCR parity across all prebuilts
The published Python wheels (glibc + musl) and the Java JAR now build
with OCR — previously CI tested `--features python,ocr,barcodes` but
release.yml shipped `--features python`, so PyPI users got no OCR.
Java JNI now builds the full ocr,rendering,signatures,barcodes,
tsa-client,system-fonts set, matching the Node/Go/C# native cdylib.
FIPS variants deliberately exclude OCR.

## Markdown-extraction quality pass
Root-cause fixes (with regression tests + a 70-PDF baseline-vs-HEAD
sweep gating every reading-order/table change):
- Table cells preserve bold/italic — tagged-PDF table_extractor now
  populates cell.spans instead of joined-text-only.
- CamelCase brand names no longer split ("SalesForce", not
  "SalesF orce") — repairs TJ-kerning misread as a word space, ASCII
  lower→UPPER signature in sparse-width spans only; all-caps and
  acronyms untouched.
- Spatial cell words no longer fragment into per-word columns —
  row-coverage phantom-column filter, gated so it only refines an
  already-detected table and never fabricates one from prose.
- Centered titles read in document order — centered-block guard in
  XY-cut keeps small centered blocks single-column.
- Fewer fragmented headings (word-per-heading + wrapped); KPI
  numeric-only heading runs collapse to a list; stray pipes escaped.
- Content-preservation policy: post-processing never drops/rewrites
  legitimate text. Band-aids that filtered page-numbers, rewrote
  bullet codepoints, flattened sparse-real tables, or deduped
  repeated content were removed after the sweep proved they damaged
  real documents.

## Review nits (PR #533)
Doc accuracy (DocumentEditor/Pdf Cleaner backstop, arch counts),
PageClass enum parity with Rust PageKind, annotations.rs dead-code,
PdfPage stale Javadoc.

## CI / Release hygiene
- Composite action .github/actions/free-disk-space (single source of
  truth; swap-storage:false locked in; df -h diagnostics) replaces 6
  drifted callsites — fixes the Code Coverage OOM on PR #533.
- macOS FIPS Java deferred (documented UnsatisfiedLinkError).

## Known issue
Tight two-column PROSE bodies can still interleave in reading order
(#534). A safe fix needs a table-vs-prose classifier; two attempts
(valley-threshold + structural detector) were reverted after the
sweep caught table-data corruption — both documented inline in
xycut.rs.
@yfedoseev yfedoseev merged commit cc77438 into main May 23, 2026
187 checks passed
@yfedoseev yfedoseev deleted the release/v0.3.53 branch May 23, 2026 02:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants