release: v0.3.53 — Java binding (8th) + OCR parity + markdown-extraction quality pass#533
Merged
Conversation
Contributor
There was a problem hiding this comment.
Pull request overview
This PR prepares the v0.3.53 release by adding a new first-class Java/JNI binding (plus tests and CI integration) and bumping all published package versions across the workspace and other language artifacts.
Changes:
- Introduces the
pdf_oxide_jniRust workspace crate (cdylib) and a full Java API surface (fyi.oxide.pdf.*) with JUnit coverage. - Adds/extends GitHub Actions jobs to build/test the Java binding (including a FIPS-mode build of the JNI shim).
- Bumps versions to
0.3.53across Cargo workspace crates, Python, Node, and C# packaging, and updates top-level README language list.
Reviewed changes
Copilot reviewed 114 out of 115 changed files in this pull request and generated 11 comments.
Show a summary per file
| File | Description |
|---|---|
| README.md | Adds Java to the bindings list and provides Maven/Gradle install snippet. |
| pyproject.toml | Bumps Python package version to 0.3.53. |
| Cargo.toml | Adds pdf_oxide_jni to workspace members; bumps crate version to 0.3.53. |
| Cargo.lock | Locks new JNI-related dependencies and bumps workspace crate versions. |
| pdf_oxide_cli/Cargo.toml | Bumps CLI crate + dependency version to 0.3.53. |
| pdf_oxide_mcp/Cargo.toml | Bumps MCP crate + dependency version to 0.3.53. |
| js/package.json | Bumps Node package version to 0.3.53. |
| csharp/PdfOxide/PdfOxide.csproj | Bumps .NET package version to 0.3.53. |
| .github/workflows/ci.yml | Adds build/test flow for Java binding; adds java-jni build-lib variant. |
| .github/workflows/ci-fips.yml | Adds FIPS build/test job for Java JNI shim. |
| java/.gitignore | Ignores Maven output and staged native resource libs. |
| java/pom.xml | Maven build + test configuration for the Java binding artifact. |
| java/README.md | Java/Kotlin binding documentation and usage examples. |
| java/src/main/java/fyi/oxide/pdf/internal/NativeLoader.java | Loads embedded native libraries from the jar at runtime. |
| java/src/main/java/fyi/oxide/pdf/PdfDocument.java | Core Java document API (open/extract/render/search/etc.). |
| java/src/main/java/fyi/oxide/pdf/PdfPage.java | Per-page Java API delegating to document handle. |
| java/src/main/java/fyi/oxide/pdf/MarkdownConverter.java | Markdown/HTML conversion façade. |
| java/src/main/java/fyi/oxide/pdf/Pdf.java | PDF creation + split-by-bookmarks API. |
| java/src/main/java/fyi/oxide/pdf/DocumentEditor.java | Write/edit surface (forms/redaction/save). |
| java/src/main/java/fyi/oxide/pdf/AutoExtractor.java | Auto-extraction/classification surface and JSON escape hatch. |
| java/src/main/java/fyi/oxide/pdf/PdfSigner.java | Java signature API (PAdES). |
| java/src/main/java/fyi/oxide/pdf/PdfValidator.java | Java compliance validator façade (PDF/A, PDF/UA). |
| java/src/main/java/fyi/oxide/pdf/PdfPolicy.java | Java crypto policy façade (set-once). |
| java/src/main/java/fyi/oxide/pdf/annotation/Annotation.java | Annotation value type. |
| java/src/main/java/fyi/oxide/pdf/annotation/AnnotationType.java | Annotation subtype enum. |
| java/src/main/java/fyi/oxide/pdf/auto/AutoResult.java | Auto-extraction result value type. |
| java/src/main/java/fyi/oxide/pdf/auto/ClassifyResult.java | Auto-extractor classification result value type. |
| java/src/main/java/fyi/oxide/pdf/auto/ExtractMode.java | Auto-extractor mode enum. |
| java/src/main/java/fyi/oxide/pdf/auto/ExtractReason.java | Degradation reason enum. |
| java/src/main/java/fyi/oxide/pdf/auto/PageClass.java | Page classification enum for classify APIs. |
| java/src/main/java/fyi/oxide/pdf/auto/RegionResult.java | Per-region auto-extraction value type. |
| java/src/main/java/fyi/oxide/pdf/compliance/PdfALevel.java | PDF/A level enum. |
| java/src/main/java/fyi/oxide/pdf/compliance/PdfUaLevel.java | PDF/UA level enum. |
| java/src/main/java/fyi/oxide/pdf/compliance/PdfXLevel.java | PDF/X level enum (placeholder surface). |
| java/src/main/java/fyi/oxide/pdf/compliance/ValidationResult.java | Validation result value type. |
| java/src/main/java/fyi/oxide/pdf/compliance/ValidationViolation.java | Validation violation value type. |
| java/src/main/java/fyi/oxide/pdf/exception/PdfException.java | Base unchecked exception type. |
| java/src/main/java/fyi/oxide/pdf/exception/PdfErrorKind.java | Error-kind enum for dispatch. |
| java/src/main/java/fyi/oxide/pdf/exception/PdfEncryptedException.java | Typed exception subtype. |
| java/src/main/java/fyi/oxide/pdf/exception/PdfInvalidStateException.java | Typed exception subtype. |
| java/src/main/java/fyi/oxide/pdf/exception/PdfIoException.java | Typed exception subtype. |
| java/src/main/java/fyi/oxide/pdf/exception/PdfOcrUnavailableException.java | Typed exception subtype. |
| java/src/main/java/fyi/oxide/pdf/exception/PdfParseException.java | Typed exception subtype. |
| java/src/main/java/fyi/oxide/pdf/exception/PdfPermissionException.java | Typed exception subtype. |
| java/src/main/java/fyi/oxide/pdf/exception/PdfSignatureException.java | Typed exception subtype. |
| java/src/main/java/fyi/oxide/pdf/exception/PdfUnsupportedException.java | Typed exception subtype. |
| java/src/main/java/fyi/oxide/pdf/form/FormField.java | Form-field value type. |
| java/src/main/java/fyi/oxide/pdf/form/FormFieldType.java | Form-field type enum. |
| java/src/main/java/fyi/oxide/pdf/geometry/BBox.java | Geometry value type. |
| java/src/main/java/fyi/oxide/pdf/geometry/Color.java | Geometry value type. |
| java/src/main/java/fyi/oxide/pdf/geometry/Point.java | Geometry value type. |
| java/src/main/java/fyi/oxide/pdf/geometry/Rect.java | Geometry value type. |
| java/src/main/java/fyi/oxide/pdf/image/ExtractedImage.java | Extracted image value type. |
| java/src/main/java/fyi/oxide/pdf/image/ImageFormat.java | Image format enum. |
| java/src/main/java/fyi/oxide/pdf/metadata/DocumentInfo.java | Info-dict value type. |
| java/src/main/java/fyi/oxide/pdf/metadata/XmpMetadata.java | XMP metadata value type. |
| java/src/main/java/fyi/oxide/pdf/policy/PolicyMode.java | Policy-mode enum. |
| java/src/main/java/fyi/oxide/pdf/policy/SecurityPolicy.java | Policy value type + builder. |
| java/src/main/java/fyi/oxide/pdf/redaction/RedactResult.java | Redaction result value type. |
| java/src/main/java/fyi/oxide/pdf/render/PixelFormat.java | Rendering pixel format enum. |
| java/src/main/java/fyi/oxide/pdf/search/SearchMatch.java | Search match value type. |
| java/src/main/java/fyi/oxide/pdf/search/SearchOptions.java | Search options value type + builder. |
| java/src/main/java/fyi/oxide/pdf/search/SearchResult.java | Search result wrapper value type. |
| java/src/main/java/fyi/oxide/pdf/signature/SignOptions.java | Signing options value type + builder. |
| java/src/main/java/fyi/oxide/pdf/signature/SignatureLevel.java | PAdES level enum. |
| java/src/main/java/fyi/oxide/pdf/split/BookmarkSegment.java | Split segment metadata value type. |
| java/src/main/java/fyi/oxide/pdf/split/SplitByBookmarksOptions.java | Split-by-bookmarks options + builder. |
| java/src/main/java/fyi/oxide/pdf/table/Table.java | Table value type. |
| java/src/main/java/fyi/oxide/pdf/table/TableCell.java | Table cell value type. |
| java/src/main/java/fyi/oxide/pdf/text/TextChar.java | Text value type. |
| java/src/main/java/fyi/oxide/pdf/text/TextLine.java | Text value type. |
| java/src/main/java/fyi/oxide/pdf/text/TextSpan.java | Text value type. |
| java/src/main/java/fyi/oxide/pdf/text/TextStyle.java | Text value type. |
| java/src/main/java/fyi/oxide/pdf/text/TextWord.java | Text value type. |
| java/src/test/java/fyi/oxide/pdf/PdfDocumentTest.java | JUnit coverage for core document behaviors. |
| java/src/test/java/fyi/oxide/pdf/PdfPageTest.java | JUnit coverage for page APIs. |
| java/src/test/java/fyi/oxide/pdf/MarkdownConverterTest.java | JUnit coverage for markdown/html conversion. |
| java/src/test/java/fyi/oxide/pdf/PdfCreationTest.java | JUnit coverage for PDF creation APIs. |
| java/src/test/java/fyi/oxide/pdf/SplitTest.java | JUnit coverage for split-by-bookmarks. |
| java/src/test/java/fyi/oxide/pdf/RenderTest.java | JUnit coverage for rendering APIs. |
| java/src/test/java/fyi/oxide/pdf/DocumentEditorTest.java | JUnit coverage for editor APIs (save/redaction/forms). |
| java/src/test/java/fyi/oxide/pdf/PdfPolicyTest.java | JUnit coverage for set-once crypto policy. |
| java/src/test/java/fyi/oxide/pdf/PdfValidatorTest.java | JUnit coverage for compliance checks. |
| java/src/test/java/fyi/oxide/pdf/PdfSignerTest.java | JUnit coverage for signature classification behavior. |
| java/src/test/java/fyi/oxide/pdf/PdfSignerSignIntegrationTest.java | Integration tests for signing (incl. TSA-gated). |
| java/src/test/java/fyi/oxide/pdf/geometry/GeometryTest.java | Pure-Java tests for geometry types. |
| java/src/test/java/fyi/oxide/pdf/exception/ExceptionHierarchyTest.java | Pure-Java tests for exception taxonomy. |
| pdf_oxide_jni/Cargo.toml | New JNI shim crate definition + feature mirroring. |
| pdf_oxide_jni/README.md | JNI shim build/packaging documentation. |
| pdf_oxide_jni/src/lib.rs | JNI shim module layout + JNI_OnLoad/OnUnload. |
| pdf_oxide_jni/src/error.rs | Rust→Java exception mapping + throw helpers. |
| pdf_oxide_jni/src/pdf_document.rs | JNI entrypoints backing PdfDocument. |
| pdf_oxide_jni/src/pdf_page.rs | JNI entrypoints backing PdfPage. |
| pdf_oxide_jni/src/markdown.rs | JNI entrypoints for MarkdownConverter. |
| pdf_oxide_jni/src/search.rs | JNI entrypoints for text search. |
| pdf_oxide_jni/src/forms.rs | JNI entrypoints for form field extraction. |
| pdf_oxide_jni/src/annotations.rs | JNI entrypoints for annotations extraction. |
| pdf_oxide_jni/src/auto_extractor.rs | JNI entrypoints for classification + JSON extraction. |
| pdf_oxide_jni/src/pdf.rs | JNI entrypoints for PDF creation/fromImages + save/close. |
| pdf_oxide_jni/src/split.rs | JNI entrypoints for split-by-bookmarks. |
| pdf_oxide_jni/src/policy.rs | JNI entrypoints for crypto policy. |
| pdf_oxide_jni/src/validator.rs | JNI entrypoints for PDF/A + PDF/UA boolean checks. |
| pdf_oxide_jni/src/signatures_pades.rs | JNI entrypoints for signing/classification (PAdES). |
| pdf_oxide_jni/src/editor.rs | JNI entrypoints for DocumentEditor. |
| pdf_oxide_jni/src/render.rs | JNI entrypoints for rendering (feature-gated). |
| pdf_oxide_jni/src/text.rs | Placeholder module for future text marshalling consolidation. |
| pdf_oxide_jni/src/images.rs | Placeholder module for future image marshalling consolidation. |
| pdf_oxide_jni/src/metadata.rs | Placeholder module for future metadata marshalling consolidation. |
| pdf_oxide_jni/src/attachments.rs | Placeholder module for future attachments marshalling. |
| pdf_oxide_jni/src/dom.rs | Placeholder module for future DOM surface marshalling. |
| pdf_oxide_jni/src/compliance.rs | Placeholder module for future full ValidationResult marshalling. |
| pdf_oxide_jni/src/redaction.rs | Placeholder module for future redaction marshalling consolidation. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
de3dfc3 to
5557d5a
Compare
yfedoseev
added a commit
that referenced
this pull request
May 21, 2026
Address Copilot review nits on PR #533 that are safe to land pre-Maven- publish: - pdf_oxide_jni/src/annotations.rs: remove dead `(x0, y0, x1, y1)` destructure left over from a borrow-checker workaround. - java/PdfPage.java: drop stale "stubbed to UnsupportedOperation" Javadoc — nativeWords/Lines/Chars are implemented in pdf_oxide_jni since #519. - java/auto/PageClass: trim CHART/ENCRYPTED enum values that the Rust PageKind never produces. Chart/encrypted-permission states already surface through ExtractReason.CHART_NOT_TRANSCRIBED and ENCRYPTED_NO_EXTRACT_PERMISSION. Lands now to avoid an ordinal- breaking change after Maven Central publish. - pdf_oxide_jni/src/auto_extractor.rs: tighten the ordinal-mapping comment to match the trimmed Java enum. - java/DocumentEditor + java/README "Lifecycle": stop claiming a Cleaner backstop exists for DocumentEditor / Pdf — only PdfDocument registers a Cleaner. Doc updated to require explicit close() for the other two. - pdf_oxide_jni/README + .github/workflows/ci.yml: correct the "six arches" claim — the JAR bundles five (linux x86_64/aarch64, macOS x86_64/aarch64, windows x86_64); no musl native ships. - .github/workflows/ci-fips.yml: refine the macos-latest deferral comment with what the previous CI run actually showed (dylib present at correct path, bare UnsatisfiedLinkError from System.load on JDK 11 macos-15 aarch64 — likely aws-lc-fips runtime symbol or Hardened-Runtime issue, not a path/extension bug). Tracks the follow-up investigation rather than a hardware- specific guess.
yfedoseev
added a commit
that referenced
this pull request
May 21, 2026
Root-cause of the v0.3.53 PR #533 Code Coverage job failure: the job called jlumbroso/free-disk-space with default settings, which includes `swap-storage: true` — i.e. removes the runner's 4 GB swapfile. cargo-llvm-cov's instrumented build (3× larger than the normal release build) then OOM-killed mid-link, and the runner's "Generate coverage" step died with no completed status. Every other free-disk-space callsite in the repo set `swap-storage: false` explicitly, with a comment warning about exactly this failure mode ("removing swap causes OOM-induced SIGBUS in the linker"). The Coverage callsite missed the override — copy-paste drift. Fix holistically rather than patching one line: - New composite action `.github/actions/free-disk-space/`: - `swap-storage: false` is locked in (not exposed as an input) - `aggressive` input controls large-packages (default true) - `tool-cache` input (default true) — set to false in jobs that need setup-python / setup-node hits - `df -h` diagnostics before/after so the next disk-pressure regression is visible in the run log - jlumbroso/free-disk-space pinned to its current SHA (was @main) - All 6 callsites converted to `uses: ./.github/actions/free-disk-space`: - ci.yml Test job (was: tool-cache:true, large-packages:true) - ci.yml Python wheel job (tool-cache: 'false') - ci.yml WASM Build (tool-cache: 'false') - ci.yml Code Coverage — bug fix lands here (was missing swap-storage:false; composite now locks it in) - python.yml Python tests (defaults) - release.yml wheel build (tool-cache: 'false') Net change: -42 lines of duplicated YAML, one source of truth for runner-disk policy, OOM lesson encoded in the action so future callsites can't drift back. df -h is now printed at every reclaim so the next disk regression is debuggable in the run log without needing local repro.
yfedoseev
added a commit
that referenced
this pull request
May 22, 2026
…n quality pass
Released 2026-05-22.
## Java is the 8th binding (fyi.oxide:pdf-oxide:0.3.53)
Native Maven-Central JNI binding on jni-rs 0.22, JDK 11 LTS floor,
five-arch fat JAR (linux x86_64/aarch64, macOS x86_64/aarch64,
windows x86_64). Full v0.3.52 surface parity across text / markdown /
AutoExtractor / forms / render / PAdES B-B+B-T+B-LT / destructive
redaction / split-by-bookmarks / compliance / crypto-policy. Free
Kotlin interop via the same JAR. New `pdf_oxide_jni` workspace crate;
CI `java` + `fips-java` jobs; release `build-java-native` +
`package-java-jar` + `publish-maven` (autoPublish=false per the
release gate). 52 JNI symbols, 9 wired classes, 82 JUnit tests.
## OCR parity across all prebuilts
The published Python wheels (glibc + musl) and the Java JAR now build
with OCR — previously CI tested `--features python,ocr,barcodes` but
release.yml shipped `--features python`, so PyPI users got no OCR.
Java JNI now builds the full ocr,rendering,signatures,barcodes,
tsa-client,system-fonts set, matching the Node/Go/C# native cdylib.
FIPS variants deliberately exclude OCR.
## Markdown-extraction quality pass
Root-cause fixes (with regression tests + a 70-PDF baseline-vs-HEAD
sweep gating every reading-order/table change):
- Table cells preserve bold/italic — tagged-PDF table_extractor now
populates cell.spans instead of joined-text-only.
- CamelCase brand names no longer split ("SalesForce", not
"SalesF orce") — repairs TJ-kerning misread as a word space, ASCII
lower→UPPER signature in sparse-width spans only; all-caps and
acronyms untouched.
- Spatial cell words no longer fragment into per-word columns —
row-coverage phantom-column filter, gated so it only refines an
already-detected table and never fabricates one from prose.
- Centered titles read in document order — centered-block guard in
XY-cut keeps small centered blocks single-column.
- Fewer fragmented headings (word-per-heading + wrapped); KPI
numeric-only heading runs collapse to a list; stray pipes escaped.
- Content-preservation policy: post-processing never drops/rewrites
legitimate text. Band-aids that filtered page-numbers, rewrote
bullet codepoints, flattened sparse-real tables, or deduped
repeated content were removed after the sweep proved they damaged
real documents.
## Review nits (PR #533)
Doc accuracy (DocumentEditor/Pdf Cleaner backstop, arch counts),
PageClass enum parity with Rust PageKind, annotations.rs dead-code,
PdfPage stale Javadoc.
## CI / Release hygiene
- Composite action .github/actions/free-disk-space (single source of
truth; swap-storage:false locked in; df -h diagnostics) replaces 6
drifted callsites — fixes the Code Coverage OOM on PR #533.
- macOS FIPS Java deferred (documented UnsatisfiedLinkError).
## Known issue
Tight two-column PROSE bodies can still interleave in reading order
(#534). A safe fix needs a table-vs-prose classifier; two attempts
(valley-threshold + structural detector) were reverted after the
sweep caught table-data corruption — both documented inline in
xycut.rs.
927ef33 to
685b74b
Compare
yfedoseev
added a commit
that referenced
this pull request
May 22, 2026
…n quality pass
Released 2026-05-22.
## Java is the 8th binding (fyi.oxide:pdf-oxide:0.3.53)
Native Maven-Central JNI binding on jni-rs 0.22, JDK 11 LTS floor,
five-arch fat JAR (linux x86_64/aarch64, macOS x86_64/aarch64,
windows x86_64). Full v0.3.52 surface parity across text / markdown /
AutoExtractor / forms / render / PAdES B-B+B-T+B-LT / destructive
redaction / split-by-bookmarks / compliance / crypto-policy. Free
Kotlin interop via the same JAR. New `pdf_oxide_jni` workspace crate;
CI `java` + `fips-java` jobs; release `build-java-native` +
`package-java-jar` + `publish-maven` (autoPublish=false per the
release gate). 52 JNI symbols, 9 wired classes, 82 JUnit tests.
## OCR parity across all prebuilts
The published Python wheels (glibc + musl) and the Java JAR now build
with OCR — previously CI tested `--features python,ocr,barcodes` but
release.yml shipped `--features python`, so PyPI users got no OCR.
Java JNI now builds the full ocr,rendering,signatures,barcodes,
tsa-client,system-fonts set, matching the Node/Go/C# native cdylib.
FIPS variants deliberately exclude OCR.
## Markdown-extraction quality pass
Root-cause fixes (with regression tests + a 70-PDF baseline-vs-HEAD
sweep gating every reading-order/table change):
- Table cells preserve bold/italic — tagged-PDF table_extractor now
populates cell.spans instead of joined-text-only.
- CamelCase brand names no longer split ("SalesForce", not
"SalesF orce") — repairs TJ-kerning misread as a word space, ASCII
lower→UPPER signature in sparse-width spans only; all-caps and
acronyms untouched.
- Spatial cell words no longer fragment into per-word columns —
row-coverage phantom-column filter, gated so it only refines an
already-detected table and never fabricates one from prose.
- Centered titles read in document order — centered-block guard in
XY-cut keeps small centered blocks single-column.
- Fewer fragmented headings (word-per-heading + wrapped); KPI
numeric-only heading runs collapse to a list; stray pipes escaped.
- Content-preservation policy: post-processing never drops/rewrites
legitimate text. Band-aids that filtered page-numbers, rewrote
bullet codepoints, flattened sparse-real tables, or deduped
repeated content were removed after the sweep proved they damaged
real documents.
## Review nits (PR #533)
Doc accuracy (DocumentEditor/Pdf Cleaner backstop, arch counts),
PageClass enum parity with Rust PageKind, annotations.rs dead-code,
PdfPage stale Javadoc.
## CI / Release hygiene
- Composite action .github/actions/free-disk-space (single source of
truth; swap-storage:false locked in; df -h diagnostics) replaces 6
drifted callsites — fixes the Code Coverage OOM on PR #533.
- macOS FIPS Java deferred (documented UnsatisfiedLinkError).
## Known issue
Tight two-column PROSE bodies can still interleave in reading order
(#534). A safe fix needs a table-vs-prose classifier; two attempts
(valley-threshold + structural detector) were reverted after the
sweep caught table-data corruption — both documented inline in
xycut.rs.
685b74b to
44786d3
Compare
yfedoseev
added a commit
that referenced
this pull request
May 22, 2026
…n quality pass
Released 2026-05-22.
## Java is the 8th binding (fyi.oxide:pdf-oxide:0.3.53)
Native Maven-Central JNI binding on jni-rs 0.22, JDK 11 LTS floor,
five-arch fat JAR (linux x86_64/aarch64, macOS x86_64/aarch64,
windows x86_64). Full v0.3.52 surface parity across text / markdown /
AutoExtractor / forms / render / PAdES B-B+B-T+B-LT / destructive
redaction / split-by-bookmarks / compliance / crypto-policy. Free
Kotlin interop via the same JAR. New `pdf_oxide_jni` workspace crate;
CI `java` + `fips-java` jobs; release `build-java-native` +
`package-java-jar` + `publish-maven` (autoPublish=false per the
release gate). 52 JNI symbols, 9 wired classes, 82 JUnit tests.
## OCR parity across all prebuilts
The published Python wheels (glibc + musl) and the Java JAR now build
with OCR — previously CI tested `--features python,ocr,barcodes` but
release.yml shipped `--features python`, so PyPI users got no OCR.
Java JNI now builds the full ocr,rendering,signatures,barcodes,
tsa-client,system-fonts set, matching the Node/Go/C# native cdylib.
FIPS variants deliberately exclude OCR.
## Markdown-extraction quality pass
Root-cause fixes (with regression tests + a 70-PDF baseline-vs-HEAD
sweep gating every reading-order/table change):
- Table cells preserve bold/italic — tagged-PDF table_extractor now
populates cell.spans instead of joined-text-only.
- CamelCase brand names no longer split ("SalesForce", not
"SalesF orce") — repairs TJ-kerning misread as a word space, ASCII
lower→UPPER signature in sparse-width spans only; all-caps and
acronyms untouched.
- Spatial cell words no longer fragment into per-word columns —
row-coverage phantom-column filter, gated so it only refines an
already-detected table and never fabricates one from prose.
- Centered titles read in document order — centered-block guard in
XY-cut keeps small centered blocks single-column.
- Fewer fragmented headings (word-per-heading + wrapped); KPI
numeric-only heading runs collapse to a list; stray pipes escaped.
- Content-preservation policy: post-processing never drops/rewrites
legitimate text. Band-aids that filtered page-numbers, rewrote
bullet codepoints, flattened sparse-real tables, or deduped
repeated content were removed after the sweep proved they damaged
real documents.
## Review nits (PR #533)
Doc accuracy (DocumentEditor/Pdf Cleaner backstop, arch counts),
PageClass enum parity with Rust PageKind, annotations.rs dead-code,
PdfPage stale Javadoc.
## CI / Release hygiene
- Composite action .github/actions/free-disk-space (single source of
truth; swap-storage:false locked in; df -h diagnostics) replaces 6
drifted callsites — fixes the Code Coverage OOM on PR #533.
- macOS FIPS Java deferred (documented UnsatisfiedLinkError).
## Known issue
Tight two-column PROSE bodies can still interleave in reading order
(#534). A safe fix needs a table-vs-prose classifier; two attempts
(valley-threshold + structural detector) were reverted after the
sweep caught table-data corruption — both documented inline in
xycut.rs.
44786d3 to
ec50586
Compare
yfedoseev
added a commit
that referenced
this pull request
May 22, 2026
…n quality pass
Released 2026-05-22.
## Java is the 8th binding (fyi.oxide:pdf-oxide:0.3.53)
Native Maven-Central JNI binding on jni-rs 0.22, JDK 11 LTS floor,
five-arch fat JAR (linux x86_64/aarch64, macOS x86_64/aarch64,
windows x86_64). Full v0.3.52 surface parity across text / markdown /
AutoExtractor / forms / render / PAdES B-B+B-T+B-LT / destructive
redaction / split-by-bookmarks / compliance / crypto-policy. Free
Kotlin interop via the same JAR. New `pdf_oxide_jni` workspace crate;
CI `java` + `fips-java` jobs; release `build-java-native` +
`package-java-jar` + `publish-maven` (autoPublish=false per the
release gate). 52 JNI symbols, 9 wired classes, 82 JUnit tests.
## OCR parity across all prebuilts
The published Python wheels (glibc + musl) and the Java JAR now build
with OCR — previously CI tested `--features python,ocr,barcodes` but
release.yml shipped `--features python`, so PyPI users got no OCR.
Java JNI now builds the full ocr,rendering,signatures,barcodes,
tsa-client,system-fonts set, matching the Node/Go/C# native cdylib.
FIPS variants deliberately exclude OCR.
## Markdown-extraction quality pass
Root-cause fixes (with regression tests + a 70-PDF baseline-vs-HEAD
sweep gating every reading-order/table change):
- Table cells preserve bold/italic — tagged-PDF table_extractor now
populates cell.spans instead of joined-text-only.
- CamelCase brand names no longer split ("SalesForce", not
"SalesF orce") — repairs TJ-kerning misread as a word space, ASCII
lower→UPPER signature in sparse-width spans only; all-caps and
acronyms untouched.
- Spatial cell words no longer fragment into per-word columns —
row-coverage phantom-column filter, gated so it only refines an
already-detected table and never fabricates one from prose.
- Centered titles read in document order — centered-block guard in
XY-cut keeps small centered blocks single-column.
- Fewer fragmented headings (word-per-heading + wrapped); KPI
numeric-only heading runs collapse to a list; stray pipes escaped.
- Content-preservation policy: post-processing never drops/rewrites
legitimate text. Band-aids that filtered page-numbers, rewrote
bullet codepoints, flattened sparse-real tables, or deduped
repeated content were removed after the sweep proved they damaged
real documents.
## Review nits (PR #533)
Doc accuracy (DocumentEditor/Pdf Cleaner backstop, arch counts),
PageClass enum parity with Rust PageKind, annotations.rs dead-code,
PdfPage stale Javadoc.
## CI / Release hygiene
- Composite action .github/actions/free-disk-space (single source of
truth; swap-storage:false locked in; df -h diagnostics) replaces 6
drifted callsites — fixes the Code Coverage OOM on PR #533.
- macOS FIPS Java deferred (documented UnsatisfiedLinkError).
## Known issue
Tight two-column PROSE bodies can still interleave in reading order
(#534). A safe fix needs a table-vs-prose classifier; two attempts
(valley-threshold + structural detector) were reverted after the
sweep caught table-data corruption — both documented inline in
xycut.rs.
ec50586 to
6d594cb
Compare
yfedoseev
added a commit
that referenced
this pull request
May 22, 2026
- table_extractor: carry the inter-block space into the synthesized cell spans so the markdown/HTML table renderers (which reconstruct spacing from spans, not cell_text) don't glue tokens across wrapped lines; add a test asserting bold/italic + spacing propagation on the tagged-PDF MCID->TextBlock path. - auto_extractor JNI: build the serde-error JSON fallback via serde_json so a failure message can't emit invalid JSON. - split JNI: drop the unused jbyteArray import and the dead _UNUSED const that only kept it alive. - MarkdownConverter: remove the unused PdfInvalidStateException import.
…n quality pass
Released 2026-05-22.
## Java is the 8th binding (fyi.oxide:pdf-oxide:0.3.53)
Native Maven-Central JNI binding on jni-rs 0.22, JDK 11 LTS floor,
five-arch fat JAR (linux x86_64/aarch64, macOS x86_64/aarch64,
windows x86_64). Full v0.3.52 surface parity across text / markdown /
AutoExtractor / forms / render / PAdES B-B+B-T+B-LT / destructive
redaction / split-by-bookmarks / compliance / crypto-policy. Free
Kotlin interop via the same JAR. New `pdf_oxide_jni` workspace crate;
CI `java` + `fips-java` jobs; release `build-java-native` +
`package-java-jar` + `publish-maven` (autoPublish=false per the
release gate). 52 JNI symbols, 9 wired classes, 82 JUnit tests.
## OCR parity across all prebuilts
The published Python wheels (glibc + musl) and the Java JAR now build
with OCR — previously CI tested `--features python,ocr,barcodes` but
release.yml shipped `--features python`, so PyPI users got no OCR.
Java JNI now builds the full ocr,rendering,signatures,barcodes,
tsa-client,system-fonts set, matching the Node/Go/C# native cdylib.
FIPS variants deliberately exclude OCR.
## Markdown-extraction quality pass
Root-cause fixes (with regression tests + a 70-PDF baseline-vs-HEAD
sweep gating every reading-order/table change):
- Table cells preserve bold/italic — tagged-PDF table_extractor now
populates cell.spans instead of joined-text-only.
- CamelCase brand names no longer split ("SalesForce", not
"SalesF orce") — repairs TJ-kerning misread as a word space, ASCII
lower→UPPER signature in sparse-width spans only; all-caps and
acronyms untouched.
- Spatial cell words no longer fragment into per-word columns —
row-coverage phantom-column filter, gated so it only refines an
already-detected table and never fabricates one from prose.
- Centered titles read in document order — centered-block guard in
XY-cut keeps small centered blocks single-column.
- Fewer fragmented headings (word-per-heading + wrapped); KPI
numeric-only heading runs collapse to a list; stray pipes escaped.
- Content-preservation policy: post-processing never drops/rewrites
legitimate text. Band-aids that filtered page-numbers, rewrote
bullet codepoints, flattened sparse-real tables, or deduped
repeated content were removed after the sweep proved they damaged
real documents.
## Review nits (PR #533)
Doc accuracy (DocumentEditor/Pdf Cleaner backstop, arch counts),
PageClass enum parity with Rust PageKind, annotations.rs dead-code,
PdfPage stale Javadoc.
## CI / Release hygiene
- Composite action .github/actions/free-disk-space (single source of
truth; swap-storage:false locked in; df -h diagnostics) replaces 6
drifted callsites — fixes the Code Coverage OOM on PR #533.
- macOS FIPS Java deferred (documented UnsatisfiedLinkError).
## Known issue
Tight two-column PROSE bodies can still interleave in reading order
(#534). A safe fix needs a table-vs-prose classifier; two attempts
(valley-threshold + structural detector) were reverted after the
sweep caught table-data corruption — both documented inline in
xycut.rs.
48e0051 to
34253e3
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
[0.3.53] — 2026-05-22
Java binding (
fyi.oxide:pdf-oxide:0.3.53)Native JNI binding on jni-rs 0.22, JDK 11 LTS floor, five-arch fat JAR
(linux x86_64/aarch64, macOS x86_64/aarch64, windows x86_64). Full
v0.3.52 surface parity across text / markdown / AutoExtractor / forms
/ render / PAdES B-B+B-T+B-LT / destructive redaction /
split-by-bookmarks / compliance / crypto-policy. Free Kotlin interop.
New
pdf_oxide_jnicrate; CIjava+fips-javajobs; releasebuild-java-native+package-java-jar+publish-maven(autoPublish=false per the release gate). 52 JNI symbols, 9 classes,
82 JUnit tests.
OCR parity across all prebuilts
Published Python wheels (glibc + musl) and the Java JAR now build with
OCR — previously CI tested OCR but
release.ymlshipped wheelswithout it. FIPS variants deliberately exclude OCR.
Markdown-extraction quality pass (root-cause fixes + TDD)
Every reading-order/table change was gated by a 70-PDF
baseline-vs-HEAD sweep:
table_extractorpopulatescell.spans).orce") — TJ-kerning misread repaired, ASCII signature only;
all-caps/acronyms untouched.
(row-coverage filter, gated so it never fabricates tables).
escaped.
legitimate text (band-aids that did were removed after the sweep
caught real-document damage).
Review nits (this PR)
Doc accuracy (Cleaner backstop, arch counts), PageClass↔PageKind
parity, dead-code, stale Javadoc.
CI / Release hygiene
Composite
free-disk-spaceaction (fixes the Code Coverage OOM);macOS FIPS Java deferred (documented).
Known issue
Tight two-column prose bodies can still interleave in reading
order (#534) — deferred; needs a table-vs-prose classifier. Two fix
attempts were reverted after the sweep caught table-data corruption.
Test plan
spatial_table_detector+xycut+extractors::textsuites greenclassified improvement or pre-existing-flaky; no regressions