Extract a shared @prose-reader/archive-metadata package
Motivation
Today, OPF metadata reading is implemented twice:
@prose-reader/streamer reads dc:title, dc:creator, rendition meta, etc. from the OPF as part of building the manifest (see packages/streamer/src/generators/manifest/hooks/epub/epub.ts and the kobo/apple variants).
oboku has its own @oboku/archive-metadata package doing essentially the same OPF parsing (plus a write side for its "metadata fixer" feature).
Both implement the same EPUB 3 / OPF spec and converge on the same key/value shape, but with different APIs, types, and parsers (xmldoc in streamer, @xmldom/xmldom in oboku). This causes:
- Duplicated parsing logic and types for the OPF metadata block.
- Drifting interpretations of edge cases (XML namespace prefixes in particular — see below).
- Bundle bloat in oboku-web, which currently ships both
xmldoc (via the streamer) and @xmldom/xmldom (~40 KB gzipped, via @oboku/archive-metadata) on its main chunk.
- No shared schema for "book metadata" between the reader and host applications.
The cleanest fix is to extract the common piece — typed OPF / ComicInfo metadata — into a new package in this monorepo and have the streamer consume it. oboku then drops its own package and depends on this one.
Goals
- One canonical
OpfMetadata type (and ComicInfoMetadata) used by @prose-reader/streamer and any host application.
- One implementation of OPF metadata reading, namespace-correct, sax-based, SW-safe.
- Optional, separately-imported writer for hosts that need to mutate metadata in-place (oboku fixer).
- No performance regression for the streamer's hot paths (manifest / spine / nav / ncx parsing stays as-is on
xmldoc).
- No new runtime dependency added to
@prose-reader/streamer.
Non-goals
- Not unifying the structural OPF parsing (
manifest, spine, guide, nav, ncx). That's reader-specific and stays in @prose-reader/streamer.
- Not replacing
xmldoc with @xmldom/xmldom in the streamer. xmldom is ~5× larger and ~3–5× slower; the streamer's perf principles in AGENTS.md argue against it. The new package is also xmldoc-based on the read side.
- Not making this an "EPUB metadata library for the world". Scope is what prose-reader and its hosts already need.
Proposed package: @prose-reader/archive-metadata
Layout
packages/archive-metadata/
src/
index.ts // re-exports types only
types.ts // OpfMetadata, ComicInfoMetadata, Identifier, Contributor, ...
opf/
read.ts // readOpfMetadata(xml: string): OpfMetadata
write.ts // writeOpfMetadata(xml: string, patch: Partial<OpfMetadata>): string
comicInfo/
read.ts // readComicInfoMetadata(xml: string): ComicInfoMetadata
write.ts // writeComicInfoMetadata(xml, patch): string
utils/
namespaces.ts // xmlns:* → URI helper, qualified-name lookup
normalizeIsbn.ts // moved from oboku
package.json
package.json exports
./read pulls only xmldoc. Used by the streamer and by any read-only host (e.g. oboku's API).
./write pulls @xmldom/xmldom. Used only by hosts that mutate metadata (oboku web fixer). Lazy-importable so the ~40 KB stays off main chunks.
index.ts exports types only so import type { OpfMetadata } from "@prose-reader/archive-metadata" is dependency-free.
Read implementation notes
-
Single pass over the OPF with xmldoc.
-
Namespace handling is done properly by walking xmlns:* attributes on the root element, building a prefix→URI map, and looking up elements by URI + local name. This fixes the prefix-sniff approach currently used in packages/streamer/src/parsers/nav.ts:
// current streamer code
const rootTagName = ncxData.name
let prefix = ``
if (rootTagName.indexOf(`:`) !== -1) {
prefix = `${rootTagName.split(`:`)[0]}:`
}
which silently breaks for documents that bind the namespace to a different prefix or use the default namespace.
-
Surface includes: title, language, identifiers (ISBN-aware via normalizeIsbn), contributors (creator/contributor + role), publisher, date, description, subjects, rights, cover-image href (resolved via meta[name=cover] / properties=cover-image), rendition layout/flow/spread.
-
Returns a plain serializable object. No live DOM references escape the package.
Write implementation notes
- Round-trip preserving: parse with
@xmldom/xmldom, mutate the requested elements only, serialize back. Don't reformat untouched parts.
- Mutation primitives match the current
oboku/packages/archive-metadata: upsert child element by tag (with namespace), remove on undefined/empty, preserve siblings.
- Same surface as the read API (whatever you can read, you can write).
Migration in @prose-reader/streamer
- Add
@prose-reader/archive-metadata as a workspace dep.
- In
packages/streamer/src/generators/manifest/hooks/epub/epub.ts, replace the dc:title + rendition meta extraction block with a single call to readOpfMetadata(opfXmlString) and pull values off the typed result.
- Same in
kobo.ts / apple.ts for any metadata fields they read (structure stays as-is on xmldoc).
- Expose the
OpfMetadata type on Manifest (or a sub-field) so consumers downstream of the streamer don't need to re-parse to learn the title / authors / cover.
- Backfill: add a streamer test that asserts the metadata in
Manifest matches readOpfMetadata for a representative corpus (use existing test EPUBs).
The streamer continues to parse the OPF once with xmldoc for structure and runs readOpfMetadata on the same string. No double XML parsing of materially different shape; no new runtime dep; no SW-incompatible APIs.
Migration in oboku (out of scope for this ticket, recorded for context)
- Replace
@oboku/archive-metadata imports with @prose-reader/archive-metadata/read (or /write lazily).
- Delete
apps/web/src/books/metadataFixer/archiveFile.ts's xmldom-only path; keep the archiveFile shape but back it with the new package.
- Net effect on oboku-web bundle:
@xmldom/xmldom (~40 KB gzipped) leaves the main chunk and only loads when the user actually applies a metadata fix.
Risks / open questions
- Cross-repo dev loop. prose-reader and oboku are separate repos. Iterating on
@prose-reader/archive-metadata while consuming it from oboku requires publish + version bumps (or npm link). Same friction we already have for @prose-reader/core, but worth flagging.
- Schema is now part of prose-reader's public API. Once
OpfMetadata is exported, changing it is a semver event. Keep the schema strictly OPF-spec-defined; host-specific concepts (oboku's "metadata source policy", per-app preferences) stay in the host.
- ComicInfo placement. If
@prose-reader/streamer doesn't currently parse ComicInfo.xml, including the type/reader in this package is fine but the dep is single-consumer (oboku) for now. Acceptable, just documenting it.
- xmldoc namespace handling.
xmldoc doesn't know about XML namespaces natively; the package needs to provide a small helper. This is the same problem the streamer already works around, so the helper benefits both sides.
- No serializer in
xmldoc. That's why writes are deliberately on @xmldom/xmldom and behind a separate import path. Reimplementing a serializer on top of xmldoc would be more code than just paying for xmldom on the write path.
Acceptance criteria
@prose-reader/archive-metadata exists in this monorepo with read / write / type-only entry points.
@prose-reader/streamer consumes readOpfMetadata for OPF metadata extraction; structural parsing unchanged.
- No new runtime dependency on
@prose-reader/streamer's own package.json.
- Existing streamer test fixtures pass; new tests cover namespace prefix variations (default namespace, non-
dc prefix on Dublin Core, missing prefix declarations).
- Changelog / migration note for downstream consumers (oboku, demo apps).
Out of scope
- Migration of oboku to the new package (separate ticket on the oboku side).
- Replacing
xmldoc with anything else in the streamer's hot paths.
- Switching on native browser
DOMParser in any environment (the streamer runs in Service Workers where it's unavailable).
Design alternatives considered
During scoping we considered moving XML parsing out of the package, so the package would only own normalization (consumer parses OPF however they want and hands in a typed JSON shape). Recording the trade-offs here so the chosen direction is explicit and we don't relitigate it later.
Option A — Package owns XML parsing and normalization (the proposal above)
Consumer passes an OPF string in, gets a typed OpfMetadata back.
- Pros
- Single, mechanical migration in consumers: streamer's
epub.ts (and kobo/apple/etc.) deletes its dc:title / rendition meta / identifier extraction in favor of one function call.
- Namespace handling, refines resolution, identifier scheme rules, cover-image fallback, OPF date parsing, MARC role codes — all live in one place, tested once.
- Clear "this package's job" framing: you give it a buffer, it tells you what's in the book.
- Cons
- Package takes a runtime XML dep (
xmldoc on the read side). Streamer already depends on xmldoc, so no new dep there; oboku consumers also gain xmldoc (~7 KB gzipped) but in oboku-web's case lose @xmldom/xmldom (~40 KB gzipped) from the main chunk in exchange — net bundle improvement.
- Schema becomes part of
@prose-reader/archive-metadata's public API surface.
Option B — Package owns only normalization (JSON in / JSON out)
Consumer parses XML themselves, builds a RawOpfMetadata JSON shape, hands it to normalizeOpf. Package has zero XML dep.
- Pros
- Package has no runtime dep at all (types + pure functions).
- Consumer keeps full control over which XML library runs in their environment (xmldoc in SW, native
DOMParser in browser, xmldom in Node, etc.).
- Trivially testable: pure JSON → JSON.
- Cons
- The boundary lands in an awkward place. To produce
RawOpfMetadata, the consumer already has to find <metadata>, iterate its children, read <dc:title> text + id + xml:lang, read <dc:identifier> + opf:scheme, walk <meta property="…" refines="#…"> blocks, etc. That's most of the OPF-reading complexity. After all that, the normalization the package provides is a small leftover.
- Each consumer's adapter is roughly the size of "just doing it yourself," so the duplication this is supposed to eliminate doesn't actually shrink much.
- "What does this package do?" loses a clean answer.
Option C — No package; share types + utilities via @prose-reader/shared
Export OpfMetadata, ComicInfoMetadata types and a few pure utilities (normalizeIsbn, marcRoleCodes, parseOpfDate, maybe pickPrimaryTitle, resolveCoverImageItemId) from the existing shared package. Each app keeps its own OPF reader.
- Pros
- Smallest possible change, no new package to set up / version / publish / document.
- Consolidates the genuinely tricky spec bits (ISBN normalization, refines resolution rules, MARC roles) without imposing a parser choice on anyone.
- The post-migration diff in
epub.ts is small but clean: replace local helper imports with shared ones.
- Cons
- Each consumer still owns its own OPF reader code; the duplication is reduced, not eliminated.
- Two
OpfMetadata-shaped types still exist if we're not careful (one in shared, one as the actual return type of each app's reader). Discipline-required, not enforced.
- Doesn't address oboku-web's main-chunk
@xmldom/xmldom cost on its own — that requires the lazy-load fix regardless.
Decision
Going with Option A. Reasoning:
- The streamer's
epub.ts (and kobo/apple/etc.) keeps re-implementing OPF metadata extraction in slightly different ways, and so does oboku in two more places. Option A's migration is the only one that produces a meaningful deletion diff in consumers — that's the test for "is this package pulling its weight?"
- Bundle math works out: streamer's
xmldoc dep is already present, so the package adds zero runtime cost there. Oboku's main chunk drops @xmldom/xmldom and the package keeps xmldoc runtime cost contained to the read path. Writes (oboku web fixer only) lazy-load the /write entry, so @xmldom/xmldom lands on the fixer-write chunk only.
- Option B was rejected because the package's value proposition collapses once parsing is excluded — the consumer ends up doing most of the work and the package's reason to exist becomes hand-wavy.
- Option C remains a fine fallback if Option A migration turns out to be more work than expected. The two are not mutually exclusive: if A stalls, lifting just the type and a few utilities into
@prose-reader/shared is a strictly smaller version of the same idea.
What this means for the package boundary
- In scope: OPF read (string →
OpfMetadata), ComicInfo read, ISBN/role/date normalization rules, optional OPF write entry behind a separate import path.
- Out of scope: structural OPF parsing (
manifest/spine/guide/nav/ncx) — stays in @prose-reader/streamer. Browser-native DOMParser selection — not relevant since the streamer runs in Service Workers.
- Public API contract:
OpfMetadata, ComicInfoMetadata, and the read/write functions. Treat schema changes as semver events.
Extract a shared
@prose-reader/archive-metadatapackageMotivation
Today, OPF metadata reading is implemented twice:
@prose-reader/streamerreadsdc:title,dc:creator, renditionmeta, etc. from the OPF as part of building the manifest (seepackages/streamer/src/generators/manifest/hooks/epub/epub.tsand the kobo/apple variants).obokuhas its own@oboku/archive-metadatapackage doing essentially the same OPF parsing (plus a write side for its "metadata fixer" feature).Both implement the same EPUB 3 / OPF spec and converge on the same key/value shape, but with different APIs, types, and parsers (
xmldocin streamer,@xmldom/xmldomin oboku). This causes:xmldoc(via the streamer) and@xmldom/xmldom(~40 KB gzipped, via@oboku/archive-metadata) on its main chunk.The cleanest fix is to extract the common piece — typed OPF / ComicInfo metadata — into a new package in this monorepo and have the streamer consume it. oboku then drops its own package and depends on this one.
Goals
OpfMetadatatype (andComicInfoMetadata) used by@prose-reader/streamerand any host application.xmldoc).@prose-reader/streamer.Non-goals
manifest,spine,guide,nav,ncx). That's reader-specific and stays in@prose-reader/streamer.xmldocwith@xmldom/xmldomin the streamer.xmldomis ~5× larger and ~3–5× slower; the streamer's perf principles inAGENTS.mdargue against it. The new package is alsoxmldoc-based on the read side.Proposed package:
@prose-reader/archive-metadataLayout
package.jsonexports{ "name": "@prose-reader/archive-metadata", "exports": { ".": { "import": "./dist/index.js" }, "./read": { "import": "./dist/read.js" }, "./write": { "import": "./dist/write.js" } }, "dependencies": { "xmldoc": "^2.0.0" }, "optionalDependencies": { "@xmldom/xmldom": "^0.9.5" } }./readpulls onlyxmldoc. Used by the streamer and by any read-only host (e.g. oboku's API)../writepulls@xmldom/xmldom. Used only by hosts that mutate metadata (oboku web fixer). Lazy-importable so the ~40 KB stays off main chunks.index.tsexports types only soimport type { OpfMetadata } from "@prose-reader/archive-metadata"is dependency-free.Read implementation notes
Single pass over the OPF with
xmldoc.Namespace handling is done properly by walking
xmlns:*attributes on the root element, building a prefix→URI map, and looking up elements by URI + local name. This fixes the prefix-sniff approach currently used inpackages/streamer/src/parsers/nav.ts:which silently breaks for documents that bind the namespace to a different prefix or use the default namespace.
Surface includes: title, language, identifiers (ISBN-aware via
normalizeIsbn), contributors (creator/contributor + role), publisher, date, description, subjects, rights, cover-image href (resolved viameta[name=cover]/properties=cover-image), rendition layout/flow/spread.Returns a plain serializable object. No live DOM references escape the package.
Write implementation notes
@xmldom/xmldom, mutate the requested elements only, serialize back. Don't reformat untouched parts.oboku/packages/archive-metadata: upsert child element by tag (with namespace), remove onundefined/empty, preserve siblings.Migration in
@prose-reader/streamer@prose-reader/archive-metadataas a workspace dep.packages/streamer/src/generators/manifest/hooks/epub/epub.ts, replace thedc:title+ renditionmetaextraction block with a single call toreadOpfMetadata(opfXmlString)and pull values off the typed result.kobo.ts/apple.tsfor any metadata fields they read (structure stays as-is onxmldoc).OpfMetadatatype onManifest(or a sub-field) so consumers downstream of the streamer don't need to re-parse to learn the title / authors / cover.ManifestmatchesreadOpfMetadatafor a representative corpus (use existing test EPUBs).The streamer continues to parse the OPF once with
xmldocfor structure and runsreadOpfMetadataon the same string. No double XML parsing of materially different shape; no new runtime dep; no SW-incompatible APIs.Migration in oboku (out of scope for this ticket, recorded for context)
@oboku/archive-metadataimports with@prose-reader/archive-metadata/read(or/writelazily).apps/web/src/books/metadataFixer/archiveFile.ts's xmldom-only path; keep thearchiveFileshape but back it with the new package.@xmldom/xmldom(~40 KB gzipped) leaves the main chunk and only loads when the user actually applies a metadata fix.Risks / open questions
@prose-reader/archive-metadatawhile consuming it from oboku requires publish + version bumps (ornpm link). Same friction we already have for@prose-reader/core, but worth flagging.OpfMetadatais exported, changing it is a semver event. Keep the schema strictly OPF-spec-defined; host-specific concepts (oboku's "metadata source policy", per-app preferences) stay in the host.@prose-reader/streamerdoesn't currently parseComicInfo.xml, including the type/reader in this package is fine but the dep is single-consumer (oboku) for now. Acceptable, just documenting it.xmldocdoesn't know about XML namespaces natively; the package needs to provide a small helper. This is the same problem the streamer already works around, so the helper benefits both sides.xmldoc. That's why writes are deliberately on@xmldom/xmldomand behind a separate import path. Reimplementing a serializer on top ofxmldocwould be more code than just paying forxmldomon the write path.Acceptance criteria
@prose-reader/archive-metadataexists in this monorepo withread/write/ type-only entry points.@prose-reader/streamerconsumesreadOpfMetadatafor OPF metadata extraction; structural parsing unchanged.@prose-reader/streamer's ownpackage.json.dcprefix on Dublin Core, missing prefix declarations).Out of scope
xmldocwith anything else in the streamer's hot paths.DOMParserin any environment (the streamer runs in Service Workers where it's unavailable).Design alternatives considered
During scoping we considered moving XML parsing out of the package, so the package would only own normalization (consumer parses OPF however they want and hands in a typed JSON shape). Recording the trade-offs here so the chosen direction is explicit and we don't relitigate it later.
Option A — Package owns XML parsing and normalization (the proposal above)
Consumer passes an OPF string in, gets a typed
OpfMetadataback.epub.ts(and kobo/apple/etc.) deletes itsdc:title/ renditionmeta/ identifier extraction in favor of one function call.xmldocon the read side). Streamer already depends onxmldoc, so no new dep there; oboku consumers also gainxmldoc(~7 KB gzipped) but in oboku-web's case lose@xmldom/xmldom(~40 KB gzipped) from the main chunk in exchange — net bundle improvement.@prose-reader/archive-metadata's public API surface.Option B — Package owns only normalization (JSON in / JSON out)
Consumer parses XML themselves, builds a
RawOpfMetadataJSON shape, hands it tonormalizeOpf. Package has zero XML dep.DOMParserin browser, xmldom in Node, etc.).RawOpfMetadata, the consumer already has to find<metadata>, iterate its children, read<dc:title>text +id+xml:lang, read<dc:identifier>+opf:scheme, walk<meta property="…" refines="#…">blocks, etc. That's most of the OPF-reading complexity. After all that, the normalization the package provides is a small leftover.Option C — No package; share types + utilities via
@prose-reader/sharedExport
OpfMetadata,ComicInfoMetadatatypes and a few pure utilities (normalizeIsbn,marcRoleCodes,parseOpfDate, maybepickPrimaryTitle,resolveCoverImageItemId) from the existing shared package. Each app keeps its own OPF reader.epub.tsis small but clean: replace local helper imports with shared ones.OpfMetadata-shaped types still exist if we're not careful (one in shared, one as the actual return type of each app's reader). Discipline-required, not enforced.@xmldom/xmldomcost on its own — that requires the lazy-load fix regardless.Decision
Going with Option A. Reasoning:
epub.ts(and kobo/apple/etc.) keeps re-implementing OPF metadata extraction in slightly different ways, and so does oboku in two more places. Option A's migration is the only one that produces a meaningful deletion diff in consumers — that's the test for "is this package pulling its weight?"xmldocdep is already present, so the package adds zero runtime cost there. Oboku's main chunk drops@xmldom/xmldomand the package keepsxmldocruntime cost contained to the read path. Writes (oboku web fixer only) lazy-load the/writeentry, so@xmldom/xmldomlands on the fixer-write chunk only.@prose-reader/sharedis a strictly smaller version of the same idea.What this means for the package boundary
OpfMetadata), ComicInfo read, ISBN/role/date normalization rules, optional OPF write entry behind a separate import path.manifest/spine/guide/nav/ncx) — stays in@prose-reader/streamer. Browser-nativeDOMParserselection — not relevant since the streamer runs in Service Workers.OpfMetadata,ComicInfoMetadata, and the read/write functions. Treat schema changes as semver events.