release: v0.3.53 — Java binding (8th) + OCR parity + markdown-extraction quality pass by yfedoseev · Pull Request #533 · yfedoseev/pdf_oxide

yfedoseev · 2026-05-21T02:41:16Z

[0.3.53] — 2026-05-22

Java is the 8th binding, plus a markdown-extraction quality pass
and OCR parity across every prebuilt. Native Maven-Central artifact
on jni-rs 0.22 (JDK 11+, five-arch fat JAR), full v0.3.52 surface
parity. Published Python wheels and the Java JAR now ship OCR
(parity with Node / Go / C#). Markdown fixes: table-cell bold/italic
preserved, CamelCase brand names no longer split, spatial cell words
no longer fragment into columns, centered titles read in order.

Java binding (`fyi.oxide:pdf-oxide:0.3.53`)

Native JNI binding on jni-rs 0.22, JDK 11 LTS floor, five-arch fat JAR
(linux x86_64/aarch64, macOS x86_64/aarch64, windows x86_64). Full
v0.3.52 surface parity across text / markdown / AutoExtractor / forms
/ render / PAdES B-B+B-T+B-LT / destructive redaction /
split-by-bookmarks / compliance / crypto-policy. Free Kotlin interop.
New pdf_oxide_jni crate; CI java + fips-java jobs; release
build-java-native + package-java-jar + publish-maven
(autoPublish=false per the release gate). 52 JNI symbols, 9 classes,
82 JUnit tests.

OCR parity across all prebuilts

Published Python wheels (glibc + musl) and the Java JAR now build with
OCR — previously CI tested OCR but release.yml shipped wheels
without it. FIPS variants deliberately exclude OCR.

Markdown-extraction quality pass (root-cause fixes + TDD)

Every reading-order/table change was gated by a 70-PDF
baseline-vs-HEAD sweep:

Table cells preserve bold/italic (table_extractor populates
cell.spans).
CamelCase brand names no longer split ("SalesForce", not "SalesF
orce") — TJ-kerning misread repaired, ASCII signature only;
all-caps/acronyms untouched.
Spatial cell words no longer fragment into per-word columns
(row-coverage filter, gated so it never fabricates tables).
Centered titles read in order (XY-cut centered-block guard).
Fewer fragmented headings; KPI numeric runs collapse; stray pipes
escaped.
Content-preservation policy: post-processing never drops/rewrites
legitimate text (band-aids that did were removed after the sweep
caught real-document damage).

Review nits (this PR)

Doc accuracy (Cleaner backstop, arch counts), PageClass↔PageKind
parity, dead-code, stale Javadoc.

CI / Release hygiene

Composite free-disk-space action (fixes the Code Coverage OOM);
macOS FIPS Java deferred (documented).

Known issue

Tight two-column prose bodies can still interleave in reading
order (#534) — deferred; needs a table-vs-prose classifier. Two fix
attempts were reverted after the sweep caught table-data corruption.

Test plan

Full Rust lib suite green (5,386 pass / 0 fail / 2 ignored)
spatial_table_detector + xycut + extractors::text suites green
70-PDF baseline(0.3.52)-vs-HEAD sweep: text/md/html diffs all
classified improvement or pre-existing-flaky; no regressions
Reporter issue PDFs validated on the HEAD build
CI green on the squashed commit
Maven Central upload reaches VALIDATED (maintainer flips Publish)

Copilot

Pull request overview

This PR prepares the v0.3.53 release by adding a new first-class Java/JNI binding (plus tests and CI integration) and bumping all published package versions across the workspace and other language artifacts.

Changes:

Introduces the pdf_oxide_jni Rust workspace crate (cdylib) and a full Java API surface (fyi.oxide.pdf.*) with JUnit coverage.
Adds/extends GitHub Actions jobs to build/test the Java binding (including a FIPS-mode build of the JNI shim).
Bumps versions to 0.3.53 across Cargo workspace crates, Python, Node, and C# packaging, and updates top-level README language list.

Reviewed changes

Copilot reviewed 114 out of 115 changed files in this pull request and generated 11 comments.

Show a summary per file

File	Description
README.md	Adds Java to the bindings list and provides Maven/Gradle install snippet.
pyproject.toml	Bumps Python package version to 0.3.53.
Cargo.toml	Adds `pdf_oxide_jni` to workspace members; bumps crate version to 0.3.53.
Cargo.lock	Locks new JNI-related dependencies and bumps workspace crate versions.
pdf_oxide_cli/Cargo.toml	Bumps CLI crate + dependency version to 0.3.53.
pdf_oxide_mcp/Cargo.toml	Bumps MCP crate + dependency version to 0.3.53.
js/package.json	Bumps Node package version to 0.3.53.
csharp/PdfOxide/PdfOxide.csproj	Bumps .NET package version to 0.3.53.
.github/workflows/ci.yml	Adds build/test flow for Java binding; adds `java-jni` build-lib variant.
.github/workflows/ci-fips.yml	Adds FIPS build/test job for Java JNI shim.
java/.gitignore	Ignores Maven output and staged native resource libs.
java/pom.xml	Maven build + test configuration for the Java binding artifact.
java/README.md	Java/Kotlin binding documentation and usage examples.
java/src/main/java/fyi/oxide/pdf/internal/NativeLoader.java	Loads embedded native libraries from the jar at runtime.
java/src/main/java/fyi/oxide/pdf/PdfDocument.java	Core Java document API (open/extract/render/search/etc.).
java/src/main/java/fyi/oxide/pdf/PdfPage.java	Per-page Java API delegating to document handle.
java/src/main/java/fyi/oxide/pdf/MarkdownConverter.java	Markdown/HTML conversion façade.
java/src/main/java/fyi/oxide/pdf/Pdf.java	PDF creation + split-by-bookmarks API.
java/src/main/java/fyi/oxide/pdf/DocumentEditor.java	Write/edit surface (forms/redaction/save).
java/src/main/java/fyi/oxide/pdf/AutoExtractor.java	Auto-extraction/classification surface and JSON escape hatch.
java/src/main/java/fyi/oxide/pdf/PdfSigner.java	Java signature API (PAdES).
java/src/main/java/fyi/oxide/pdf/PdfValidator.java	Java compliance validator façade (PDF/A, PDF/UA).
java/src/main/java/fyi/oxide/pdf/PdfPolicy.java	Java crypto policy façade (set-once).
java/src/main/java/fyi/oxide/pdf/annotation/Annotation.java	Annotation value type.
java/src/main/java/fyi/oxide/pdf/annotation/AnnotationType.java	Annotation subtype enum.
java/src/main/java/fyi/oxide/pdf/auto/AutoResult.java	Auto-extraction result value type.
java/src/main/java/fyi/oxide/pdf/auto/ClassifyResult.java	Auto-extractor classification result value type.
java/src/main/java/fyi/oxide/pdf/auto/ExtractMode.java	Auto-extractor mode enum.
java/src/main/java/fyi/oxide/pdf/auto/ExtractReason.java	Degradation reason enum.
java/src/main/java/fyi/oxide/pdf/auto/PageClass.java	Page classification enum for classify APIs.
java/src/main/java/fyi/oxide/pdf/auto/RegionResult.java	Per-region auto-extraction value type.
java/src/main/java/fyi/oxide/pdf/compliance/PdfALevel.java	PDF/A level enum.
java/src/main/java/fyi/oxide/pdf/compliance/PdfUaLevel.java	PDF/UA level enum.
java/src/main/java/fyi/oxide/pdf/compliance/PdfXLevel.java	PDF/X level enum (placeholder surface).
java/src/main/java/fyi/oxide/pdf/compliance/ValidationResult.java	Validation result value type.
java/src/main/java/fyi/oxide/pdf/compliance/ValidationViolation.java	Validation violation value type.
java/src/main/java/fyi/oxide/pdf/exception/PdfException.java	Base unchecked exception type.
java/src/main/java/fyi/oxide/pdf/exception/PdfErrorKind.java	Error-kind enum for dispatch.
java/src/main/java/fyi/oxide/pdf/exception/PdfEncryptedException.java	Typed exception subtype.
java/src/main/java/fyi/oxide/pdf/exception/PdfInvalidStateException.java	Typed exception subtype.
java/src/main/java/fyi/oxide/pdf/exception/PdfIoException.java	Typed exception subtype.
java/src/main/java/fyi/oxide/pdf/exception/PdfOcrUnavailableException.java	Typed exception subtype.
java/src/main/java/fyi/oxide/pdf/exception/PdfParseException.java	Typed exception subtype.
java/src/main/java/fyi/oxide/pdf/exception/PdfPermissionException.java	Typed exception subtype.
java/src/main/java/fyi/oxide/pdf/exception/PdfSignatureException.java	Typed exception subtype.
java/src/main/java/fyi/oxide/pdf/exception/PdfUnsupportedException.java	Typed exception subtype.
java/src/main/java/fyi/oxide/pdf/form/FormField.java	Form-field value type.
java/src/main/java/fyi/oxide/pdf/form/FormFieldType.java	Form-field type enum.
java/src/main/java/fyi/oxide/pdf/geometry/BBox.java	Geometry value type.
java/src/main/java/fyi/oxide/pdf/geometry/Color.java	Geometry value type.
java/src/main/java/fyi/oxide/pdf/geometry/Point.java	Geometry value type.
java/src/main/java/fyi/oxide/pdf/geometry/Rect.java	Geometry value type.
java/src/main/java/fyi/oxide/pdf/image/ExtractedImage.java	Extracted image value type.
java/src/main/java/fyi/oxide/pdf/image/ImageFormat.java	Image format enum.
java/src/main/java/fyi/oxide/pdf/metadata/DocumentInfo.java	Info-dict value type.
java/src/main/java/fyi/oxide/pdf/metadata/XmpMetadata.java	XMP metadata value type.
java/src/main/java/fyi/oxide/pdf/policy/PolicyMode.java	Policy-mode enum.
java/src/main/java/fyi/oxide/pdf/policy/SecurityPolicy.java	Policy value type + builder.
java/src/main/java/fyi/oxide/pdf/redaction/RedactResult.java	Redaction result value type.
java/src/main/java/fyi/oxide/pdf/render/PixelFormat.java	Rendering pixel format enum.
java/src/main/java/fyi/oxide/pdf/search/SearchMatch.java	Search match value type.
java/src/main/java/fyi/oxide/pdf/search/SearchOptions.java	Search options value type + builder.
java/src/main/java/fyi/oxide/pdf/search/SearchResult.java	Search result wrapper value type.
java/src/main/java/fyi/oxide/pdf/signature/SignOptions.java	Signing options value type + builder.
java/src/main/java/fyi/oxide/pdf/signature/SignatureLevel.java	PAdES level enum.
java/src/main/java/fyi/oxide/pdf/split/BookmarkSegment.java	Split segment metadata value type.
java/src/main/java/fyi/oxide/pdf/split/SplitByBookmarksOptions.java	Split-by-bookmarks options + builder.
java/src/main/java/fyi/oxide/pdf/table/Table.java	Table value type.
java/src/main/java/fyi/oxide/pdf/table/TableCell.java	Table cell value type.
java/src/main/java/fyi/oxide/pdf/text/TextChar.java	Text value type.
java/src/main/java/fyi/oxide/pdf/text/TextLine.java	Text value type.
java/src/main/java/fyi/oxide/pdf/text/TextSpan.java	Text value type.
java/src/main/java/fyi/oxide/pdf/text/TextStyle.java	Text value type.
java/src/main/java/fyi/oxide/pdf/text/TextWord.java	Text value type.
java/src/test/java/fyi/oxide/pdf/PdfDocumentTest.java	JUnit coverage for core document behaviors.
java/src/test/java/fyi/oxide/pdf/PdfPageTest.java	JUnit coverage for page APIs.
java/src/test/java/fyi/oxide/pdf/MarkdownConverterTest.java	JUnit coverage for markdown/html conversion.
java/src/test/java/fyi/oxide/pdf/PdfCreationTest.java	JUnit coverage for PDF creation APIs.
java/src/test/java/fyi/oxide/pdf/SplitTest.java	JUnit coverage for split-by-bookmarks.
java/src/test/java/fyi/oxide/pdf/RenderTest.java	JUnit coverage for rendering APIs.
java/src/test/java/fyi/oxide/pdf/DocumentEditorTest.java	JUnit coverage for editor APIs (save/redaction/forms).
java/src/test/java/fyi/oxide/pdf/PdfPolicyTest.java	JUnit coverage for set-once crypto policy.
java/src/test/java/fyi/oxide/pdf/PdfValidatorTest.java	JUnit coverage for compliance checks.
java/src/test/java/fyi/oxide/pdf/PdfSignerTest.java	JUnit coverage for signature classification behavior.
java/src/test/java/fyi/oxide/pdf/PdfSignerSignIntegrationTest.java	Integration tests for signing (incl. TSA-gated).
java/src/test/java/fyi/oxide/pdf/geometry/GeometryTest.java	Pure-Java tests for geometry types.
java/src/test/java/fyi/oxide/pdf/exception/ExceptionHierarchyTest.java	Pure-Java tests for exception taxonomy.
pdf_oxide_jni/Cargo.toml	New JNI shim crate definition + feature mirroring.
pdf_oxide_jni/README.md	JNI shim build/packaging documentation.
pdf_oxide_jni/src/lib.rs	JNI shim module layout + JNI_OnLoad/OnUnload.
pdf_oxide_jni/src/error.rs	Rust→Java exception mapping + throw helpers.
pdf_oxide_jni/src/pdf_document.rs	JNI entrypoints backing `PdfDocument`.
pdf_oxide_jni/src/pdf_page.rs	JNI entrypoints backing `PdfPage`.
pdf_oxide_jni/src/markdown.rs	JNI entrypoints for MarkdownConverter.
pdf_oxide_jni/src/search.rs	JNI entrypoints for text search.
pdf_oxide_jni/src/forms.rs	JNI entrypoints for form field extraction.
pdf_oxide_jni/src/annotations.rs	JNI entrypoints for annotations extraction.
pdf_oxide_jni/src/auto_extractor.rs	JNI entrypoints for classification + JSON extraction.
pdf_oxide_jni/src/pdf.rs	JNI entrypoints for PDF creation/fromImages + save/close.
pdf_oxide_jni/src/split.rs	JNI entrypoints for split-by-bookmarks.
pdf_oxide_jni/src/policy.rs	JNI entrypoints for crypto policy.
pdf_oxide_jni/src/validator.rs	JNI entrypoints for PDF/A + PDF/UA boolean checks.
pdf_oxide_jni/src/signatures_pades.rs	JNI entrypoints for signing/classification (PAdES).
pdf_oxide_jni/src/editor.rs	JNI entrypoints for `DocumentEditor`.
pdf_oxide_jni/src/render.rs	JNI entrypoints for rendering (feature-gated).
pdf_oxide_jni/src/text.rs	Placeholder module for future text marshalling consolidation.
pdf_oxide_jni/src/images.rs	Placeholder module for future image marshalling consolidation.
pdf_oxide_jni/src/metadata.rs	Placeholder module for future metadata marshalling consolidation.
pdf_oxide_jni/src/attachments.rs	Placeholder module for future attachments marshalling.
pdf_oxide_jni/src/dom.rs	Placeholder module for future DOM surface marshalling.
pdf_oxide_jni/src/compliance.rs	Placeholder module for future full ValidationResult marshalling.
pdf_oxide_jni/src/redaction.rs	Placeholder module for future redaction marshalling consolidation.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Address Copilot review nits on PR #533 that are safe to land pre-Maven- publish: - pdf_oxide_jni/src/annotations.rs: remove dead `(x0, y0, x1, y1)` destructure left over from a borrow-checker workaround. - java/PdfPage.java: drop stale "stubbed to UnsupportedOperation" Javadoc — nativeWords/Lines/Chars are implemented in pdf_oxide_jni since #519. - java/auto/PageClass: trim CHART/ENCRYPTED enum values that the Rust PageKind never produces. Chart/encrypted-permission states already surface through ExtractReason.CHART_NOT_TRANSCRIBED and ENCRYPTED_NO_EXTRACT_PERMISSION. Lands now to avoid an ordinal- breaking change after Maven Central publish. - pdf_oxide_jni/src/auto_extractor.rs: tighten the ordinal-mapping comment to match the trimmed Java enum. - java/DocumentEditor + java/README "Lifecycle": stop claiming a Cleaner backstop exists for DocumentEditor / Pdf — only PdfDocument registers a Cleaner. Doc updated to require explicit close() for the other two. - pdf_oxide_jni/README + .github/workflows/ci.yml: correct the "six arches" claim — the JAR bundles five (linux x86_64/aarch64, macOS x86_64/aarch64, windows x86_64); no musl native ships. - .github/workflows/ci-fips.yml: refine the macos-latest deferral comment with what the previous CI run actually showed (dylib present at correct path, bare UnsatisfiedLinkError from System.load on JDK 11 macos-15 aarch64 — likely aws-lc-fips runtime symbol or Hardened-Runtime issue, not a path/extension bug). Tracks the follow-up investigation rather than a hardware- specific guess.

@main

Root-cause of the v0.3.53 PR #533 Code Coverage job failure: the job called jlumbroso/free-disk-space with default settings, which includes `swap-storage: true` — i.e. removes the runner's 4 GB swapfile. cargo-llvm-cov's instrumented build (3× larger than the normal release build) then OOM-killed mid-link, and the runner's "Generate coverage" step died with no completed status. Every other free-disk-space callsite in the repo set `swap-storage: false` explicitly, with a comment warning about exactly this failure mode ("removing swap causes OOM-induced SIGBUS in the linker"). The Coverage callsite missed the override — copy-paste drift. Fix holistically rather than patching one line: - New composite action `.github/actions/free-disk-space/`: - `swap-storage: false` is locked in (not exposed as an input) - `aggressive` input controls large-packages (default true) - `tool-cache` input (default true) — set to false in jobs that need setup-python / setup-node hits - `df -h` diagnostics before/after so the next disk-pressure regression is visible in the run log - jlumbroso/free-disk-space pinned to its current SHA (was @main) - All 6 callsites converted to `uses: ./.github/actions/free-disk-space`: - ci.yml Test job (was: tool-cache:true, large-packages:true) - ci.yml Python wheel job (tool-cache: 'false') - ci.yml WASM Build (tool-cache: 'false') - ci.yml Code Coverage — bug fix lands here (was missing swap-storage:false; composite now locks it in) - python.yml Python tests (defaults) - release.yml wheel build (tool-cache: 'false') Net change: -42 lines of duplicated YAML, one source of truth for runner-disk policy, OOM lesson encoded in the action so future callsites can't drift back. df -h is now printed at every reclaim so the next disk regression is debuggable in the run log without needing local repro.

…n quality pass Released 2026-05-22. ## Java is the 8th binding (fyi.oxide:pdf-oxide:0.3.53) Native Maven-Central JNI binding on jni-rs 0.22, JDK 11 LTS floor, five-arch fat JAR (linux x86_64/aarch64, macOS x86_64/aarch64, windows x86_64). Full v0.3.52 surface parity across text / markdown / AutoExtractor / forms / render / PAdES B-B+B-T+B-LT / destructive redaction / split-by-bookmarks / compliance / crypto-policy. Free Kotlin interop via the same JAR. New `pdf_oxide_jni` workspace crate; CI `java` + `fips-java` jobs; release `build-java-native` + `package-java-jar` + `publish-maven` (autoPublish=false per the release gate). 52 JNI symbols, 9 wired classes, 82 JUnit tests. ## OCR parity across all prebuilts The published Python wheels (glibc + musl) and the Java JAR now build with OCR — previously CI tested `--features python,ocr,barcodes` but release.yml shipped `--features python`, so PyPI users got no OCR. Java JNI now builds the full ocr,rendering,signatures,barcodes, tsa-client,system-fonts set, matching the Node/Go/C# native cdylib. FIPS variants deliberately exclude OCR. ## Markdown-extraction quality pass Root-cause fixes (with regression tests + a 70-PDF baseline-vs-HEAD sweep gating every reading-order/table change): - Table cells preserve bold/italic — tagged-PDF table_extractor now populates cell.spans instead of joined-text-only. - CamelCase brand names no longer split ("SalesForce", not "SalesF orce") — repairs TJ-kerning misread as a word space, ASCII lower→UPPER signature in sparse-width spans only; all-caps and acronyms untouched. - Spatial cell words no longer fragment into per-word columns — row-coverage phantom-column filter, gated so it only refines an already-detected table and never fabricates one from prose. - Centered titles read in document order — centered-block guard in XY-cut keeps small centered blocks single-column. - Fewer fragmented headings (word-per-heading + wrapped); KPI numeric-only heading runs collapse to a list; stray pipes escaped. - Content-preservation policy: post-processing never drops/rewrites legitimate text. Band-aids that filtered page-numbers, rewrote bullet codepoints, flattened sparse-real tables, or deduped repeated content were removed after the sweep proved they damaged real documents. ## Review nits (PR #533) Doc accuracy (DocumentEditor/Pdf Cleaner backstop, arch counts), PageClass enum parity with Rust PageKind, annotations.rs dead-code, PdfPage stale Javadoc. ## CI / Release hygiene - Composite action .github/actions/free-disk-space (single source of truth; swap-storage:false locked in; df -h diagnostics) replaces 6 drifted callsites — fixes the Code Coverage OOM on PR #533. - macOS FIPS Java deferred (documented UnsatisfiedLinkError). ## Known issue Tight two-column PROSE bodies can still interleave in reading order (#534). A safe fix needs a table-vs-prose classifier; two attempts (valley-threshold + structural detector) were reverted after the sweep caught table-data corruption — both documented inline in xycut.rs.

Copilot

Pull request overview

Copilot reviewed 122 out of 123 changed files in this pull request and generated 5 comments.

Copilot

Pull request overview

Copilot reviewed 122 out of 123 changed files in this pull request and generated 4 comments.

…n quality pass Released 2026-05-22. ## Java is the 8th binding (fyi.oxide:pdf-oxide:0.3.53) Native Maven-Central JNI binding on jni-rs 0.22, JDK 11 LTS floor, five-arch fat JAR (linux x86_64/aarch64, macOS x86_64/aarch64, windows x86_64). Full v0.3.52 surface parity across text / markdown / AutoExtractor / forms / render / PAdES B-B+B-T+B-LT / destructive redaction / split-by-bookmarks / compliance / crypto-policy. Free Kotlin interop via the same JAR. New `pdf_oxide_jni` workspace crate; CI `java` + `fips-java` jobs; release `build-java-native` + `package-java-jar` + `publish-maven` (autoPublish=false per the release gate). 52 JNI symbols, 9 wired classes, 82 JUnit tests. ## OCR parity across all prebuilts The published Python wheels (glibc + musl) and the Java JAR now build with OCR — previously CI tested `--features python,ocr,barcodes` but release.yml shipped `--features python`, so PyPI users got no OCR. Java JNI now builds the full ocr,rendering,signatures,barcodes, tsa-client,system-fonts set, matching the Node/Go/C# native cdylib. FIPS variants deliberately exclude OCR. ## Markdown-extraction quality pass Root-cause fixes (with regression tests + a 70-PDF baseline-vs-HEAD sweep gating every reading-order/table change): - Table cells preserve bold/italic — tagged-PDF table_extractor now populates cell.spans instead of joined-text-only. - CamelCase brand names no longer split ("SalesForce", not "SalesF orce") — repairs TJ-kerning misread as a word space, ASCII lower→UPPER signature in sparse-width spans only; all-caps and acronyms untouched. - Spatial cell words no longer fragment into per-word columns — row-coverage phantom-column filter, gated so it only refines an already-detected table and never fabricates one from prose. - Centered titles read in document order — centered-block guard in XY-cut keeps small centered blocks single-column. - Fewer fragmented headings (word-per-heading + wrapped); KPI numeric-only heading runs collapse to a list; stray pipes escaped. - Content-preservation policy: post-processing never drops/rewrites legitimate text. Band-aids that filtered page-numbers, rewrote bullet codepoints, flattened sparse-real tables, or deduped repeated content were removed after the sweep proved they damaged real documents. ## Review nits (PR #533) Doc accuracy (DocumentEditor/Pdf Cleaner backstop, arch counts), PageClass enum parity with Rust PageKind, annotations.rs dead-code, PdfPage stale Javadoc. ## CI / Release hygiene - Composite action .github/actions/free-disk-space (single source of truth; swap-storage:false locked in; df -h diagnostics) replaces 6 drifted callsites — fixes the Code Coverage OOM on PR #533. - macOS FIPS Java deferred (documented UnsatisfiedLinkError). ## Known issue Tight two-column PROSE bodies can still interleave in reading order (#534). A safe fix needs a table-vs-prose classifier; two attempts (valley-threshold + structural detector) were reverted after the sweep caught table-data corruption — both documented inline in xycut.rs.

Copilot

Pull request overview

Copilot reviewed 122 out of 123 changed files in this pull request and generated 3 comments.

- table_extractor: carry the inter-block space into the synthesized cell spans so the markdown/HTML table renderers (which reconstruct spacing from spans, not cell_text) don't glue tokens across wrapped lines; add a test asserting bold/italic + spacing propagation on the tagged-PDF MCID->TextBlock path. - auto_extractor JNI: build the serde-error JSON fallback via serde_json so a failure message can't emit invalid JSON. - split JNI: drop the unused jbyteArray import and the dead _UNUSED const that only kept it alive. - MarkdownConverter: remove the unused PdfInvalidStateException import.

…n quality pass Released 2026-05-22. ## Java is the 8th binding (fyi.oxide:pdf-oxide:0.3.53) Native Maven-Central JNI binding on jni-rs 0.22, JDK 11 LTS floor, five-arch fat JAR (linux x86_64/aarch64, macOS x86_64/aarch64, windows x86_64). Full v0.3.52 surface parity across text / markdown / AutoExtractor / forms / render / PAdES B-B+B-T+B-LT / destructive redaction / split-by-bookmarks / compliance / crypto-policy. Free Kotlin interop via the same JAR. New `pdf_oxide_jni` workspace crate; CI `java` + `fips-java` jobs; release `build-java-native` + `package-java-jar` + `publish-maven` (autoPublish=false per the release gate). 52 JNI symbols, 9 wired classes, 82 JUnit tests. ## OCR parity across all prebuilts The published Python wheels (glibc + musl) and the Java JAR now build with OCR — previously CI tested `--features python,ocr,barcodes` but release.yml shipped `--features python`, so PyPI users got no OCR. Java JNI now builds the full ocr,rendering,signatures,barcodes, tsa-client,system-fonts set, matching the Node/Go/C# native cdylib. FIPS variants deliberately exclude OCR. ## Markdown-extraction quality pass Root-cause fixes (with regression tests + a 70-PDF baseline-vs-HEAD sweep gating every reading-order/table change): - Table cells preserve bold/italic — tagged-PDF table_extractor now populates cell.spans instead of joined-text-only. - CamelCase brand names no longer split ("SalesForce", not "SalesF orce") — repairs TJ-kerning misread as a word space, ASCII lower→UPPER signature in sparse-width spans only; all-caps and acronyms untouched. - Spatial cell words no longer fragment into per-word columns — row-coverage phantom-column filter, gated so it only refines an already-detected table and never fabricates one from prose. - Centered titles read in document order — centered-block guard in XY-cut keeps small centered blocks single-column. - Fewer fragmented headings (word-per-heading + wrapped); KPI numeric-only heading runs collapse to a list; stray pipes escaped. - Content-preservation policy: post-processing never drops/rewrites legitimate text. Band-aids that filtered page-numbers, rewrote bullet codepoints, flattened sparse-real tables, or deduped repeated content were removed after the sweep proved they damaged real documents. ## Review nits (PR #533) Doc accuracy (DocumentEditor/Pdf Cleaner backstop, arch counts), PageClass enum parity with Rust PageKind, annotations.rs dead-code, PdfPage stale Javadoc. ## CI / Release hygiene - Composite action .github/actions/free-disk-space (single source of truth; swap-storage:false locked in; df -h diagnostics) replaces 6 drifted callsites — fixes the Code Coverage OOM on PR #533. - macOS FIPS Java deferred (documented UnsatisfiedLinkError). ## Known issue Tight two-column PROSE bodies can still interleave in reading order (#534). A safe fix needs a table-vs-prose classifier; two attempts (valley-threshold + structural detector) were reverted after the sweep caught table-data corruption — both documented inline in xycut.rs.

yfedoseev requested a review from Copilot May 21, 2026 02:44

Copilot started reviewing on behalf of yfedoseev May 21, 2026 02:44 View session

Copilot AI reviewed May 21, 2026

View reviewed changes

yfedoseev force-pushed the release/v0.3.53 branch 8 times, most recently from de3dfc3 to 5557d5a Compare May 21, 2026 07:10

yfedoseev force-pushed the release/v0.3.53 branch from 927ef33 to 685b74b Compare May 22, 2026 03:28

yfedoseev changed the title ~~release: v0.3.53 — Java is the 8th binding (jni-rs 0.22, JDK 11+, fyi.oxide:pdf-oxide)~~ release: v0.3.53 — Java binding (8th) + OCR parity + markdown-extraction quality pass May 22, 2026

yfedoseev requested a review from Copilot May 22, 2026 03:31

Copilot started reviewing on behalf of yfedoseev May 22, 2026 03:31 View session

yfedoseev force-pushed the release/v0.3.53 branch from 685b74b to 44786d3 Compare May 22, 2026 03:33

Copilot AI reviewed May 22, 2026

View reviewed changes

yfedoseev requested a review from Copilot May 22, 2026 04:22

Copilot started reviewing on behalf of yfedoseev May 22, 2026 04:22 View session

Copilot AI reviewed May 22, 2026

View reviewed changes

Comment thread src/structure/table_extractor.rs

Comment thread pdf_oxide_jni/src/auto_extractor.rs Outdated

Comment thread pdf_oxide_jni/src/auto_extractor.rs Outdated

Comment thread pdf_oxide_jni/src/render.rs

yfedoseev force-pushed the release/v0.3.53 branch from 44786d3 to ec50586 Compare May 22, 2026 04:42

yfedoseev force-pushed the release/v0.3.53 branch from ec50586 to 6d594cb Compare May 22, 2026 05:08

yfedoseev requested a review from Copilot May 22, 2026 05:47

Copilot started reviewing on behalf of yfedoseev May 22, 2026 05:47 View session

Copilot AI reviewed May 22, 2026

View reviewed changes

Comment thread java/src/main/java/fyi/oxide/pdf/MarkdownConverter.java Outdated

Comment thread src/structure/table_extractor.rs

Comment thread pdf_oxide_jni/src/split.rs Outdated

yfedoseev force-pushed the release/v0.3.53 branch from 48e0051 to 34253e3 Compare May 23, 2026 00:36

yfedoseev merged commit cc77438 into main May 23, 2026
187 checks passed

yfedoseev deleted the release/v0.3.53 branch May 23, 2026 02:25

Uh oh!

Conversation

yfedoseev commented May 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

[0.3.53] — 2026-05-22

Java binding (fyi.oxide:pdf-oxide:0.3.53)

OCR parity across all prebuilts

Markdown-extraction quality pass (root-cause fixes + TDD)

Review nits (this PR)

CI / Release hygiene

Known issue

Test plan

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

yfedoseev commented May 21, 2026 •

edited

Loading

Java binding (`fyi.oxide:pdf-oxide:0.3.53`)