Skip to content

Decode Kindle FONT container records for AZW3/MOBI font extraction#21

Open
Imaclean74 wants to merge 1 commit into
zacharydenton:masterfrom
Imaclean74:feat/azw3-mobi-font-extraction
Open

Decode Kindle FONT container records for AZW3/MOBI font extraction#21
Imaclean74 wants to merge 1 commit into
zacharydenton:masterfrom
Imaclean74:feat/azw3-mobi-font-extraction

Conversation

@Imaclean74

Copy link
Copy Markdown
Contributor

Closes #20.

Summary

AZW3 and MOBI ebooks can embed fonts wrapped in a Kindle-specific FONT
FourCC container with optional XOR obfuscation (first 1040 bytes masked
with a per-record key) and optional zlib compression. The importer
previously classified FONT records as metadata and skipped them, so
embedded fonts never reached the asset list and books with custom
typography rendered with fallback system fonts.

Surveyed our 864-book AZW3/MOBI test corpus: 109 books carry FONT
records, 2,121 total
. With this patch applied, the importer surfaces
them as fonts/font_NNNN.otf assets and load_image_record returns the
unwrapped font bytes.

Related work

#13 added KFX font extraction by surfacing bcRawFont entities as
fonts/font_NNNN.otf assets. KFX and AZW3/MOBI use different on-disk
mechanisms for embedded fonts (Ion entity table vs. FONT FourCC
container), so this PR is the AZW3/MOBI counterpart to that earlier
work — it does not duplicate or overlap with KFX font handling.

Changes

  • src/mobi/parser.rs:
    • Add decode_font_record — parses the 24-byte header, reverses the
      XOR mask over the first 1040 bytes when bit 1 is set, then zlib-
      decompresses when bit 0 is set.
    • Recognise the FONT magic in detect_font_type (defaults to
      \"otf\" extension).
    • Drop FONT from is_metadata_record.
  • src/mobi/mod.rs: re-export decode_font_record.
  • src/import/azw3.rs and src/import/mobi.rs:
    • Detect font records in discover_assets (plus the standalone
      discover_assets_from_source in mobi.rs) and emit
      fonts/font_NNNN.<ext> paths.
    • Accept the fonts/font_NNNN.ext prefix in load_asset (image and
      font records share the same index space; the prefix just selects
      naming).
    • Dispatch to decode_font_record in load_image_record when the
      record starts with FONT.

Tests

Seven new unit tests in mobi::parser::tests:

  • test_decode_font_record_uncompressed_plain — bare payload, no flags
  • test_decode_font_record_zlib_compressed — zlib only
  • test_decode_font_record_xor_obfuscated — XOR mask only
  • test_decode_font_record_xor_and_zlib — both flags (real-world case)
  • test_decode_font_record_rejects_wrong_magic
  • test_decode_font_record_rejects_truncated
  • test_decode_font_record_rejects_offset_beyond_record

Existing test_detect_font_type and test_is_metadata_record updated for
the new FONT classification.

Verification

```
cargo fmt -- --check
cargo clippy --lib
cargo test --lib # 555 passed
```

End-to-end check on a real AZW3 with one embedded font: the exported EPUB
contains OEBPS/fonts/font_0000.otf whose first bytes are the TrueType
magic 00 01 00 00 followed by the expected font tables — confirming the
full decode pipeline (XOR-unmask → zlib-inflate → write-to-EPUB) works.

References

AZW3 and MOBI ebooks can embed fonts wrapped in a Kindle-specific
\`FONT\` FourCC container with optional XOR obfuscation (first 1040 bytes
masked with a per-record key) and optional zlib compression. Previously
the importer classified \`FONT\` records as metadata and skipped them
entirely, so embedded fonts never reached the asset list and books with
custom typography rendered with fallback system fonts.

This patch:
- Adds \`decode_font_record\` to parse the \`FONT\` container header,
  reverse the XOR mask, and zlib-decompress the payload, returning the
  raw font bytes (typically OTF / TTF / WOFF).
- Recognises the \`FONT\` magic in \`detect_font_type\` (defaults to
  \`.otf\` extension; the actual format is known after decoding but
  e-readers identify fonts via the \`@font-face src:\` MIME, not the
  filename extension).
- Removes \`FONT\` from \`is_metadata_record\` so the records flow
  through \`discover_assets\`.
- Updates AZW3 and MOBI \`discover_assets\` (plus the standalone
  \`discover_assets_from_source\`) to emit \`fonts/font_NNNN.<ext>\`
  paths when a record sniffs as a font.
- Extends AZW3 and MOBI \`load_asset\` to accept the
  \`fonts/font_NNNN.ext\` prefix (image and font records share the
  same index space, so the prefix just selects naming).
- Updates AZW3 and MOBI \`load_image_record\` to dispatch to
  \`decode_font_record\` when the record starts with \`FONT\`.

Includes 7 unit tests covering the four flag combinations
(uncompressed / zlib-only / XOR-only / both), wrong-magic rejection,
truncated-record rejection, and out-of-range data-offset rejection.

## Format references

- MobileRead Wiki — MOBI / AZW format overview, palm record types:
  https://wiki.mobileread.com/wiki/MOBI
- Amazon Publishing Guidelines — embedded font support in AZW3
  (\`@font-face\` rules are honoured by Kindle e-ink and apps):
  https://kdp.amazon.com/en_US/help/topic/G201834180
- EPUB 3.3 Core Media Types §3.3 — OTF / TTF / WOFF are accepted
  font resource MIME types:
  https://www.w3.org/TR/epub-33/#sec-core-media-types
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

AZW3/MOBI font extraction skipped because FONT records are classified as metadata

1 participant