consolidation metadata fixer

# Extract a shared `@prose-reader/archive-metadata` package

## Motivation

Today, OPF metadata reading is implemented twice:

- **`@prose-reader/streamer`** reads `dc:title`, `dc:creator`, rendition `meta`, etc. from the OPF as part of building the manifest (see `packages/streamer/src/generators/manifest/hooks/epub/epub.ts` and the kobo/apple variants).
- **`oboku`** has its own `@oboku/archive-metadata` package doing essentially the same OPF parsing (plus a write side for its "metadata fixer" feature).

Both implement the same EPUB 3 / OPF spec and converge on the same key/value shape, but with different APIs, types, and parsers (`xmldoc` in streamer, `@xmldom/xmldom` in oboku). This causes:

- Duplicated parsing logic and types for the OPF metadata block.
- Drifting interpretations of edge cases (XML namespace prefixes in particular — see below).
- Bundle bloat in oboku-web, which currently ships **both** `xmldoc` (via the streamer) **and** `@xmldom/xmldom` (~40 KB gzipped, via `@oboku/archive-metadata`) on its main chunk.
- No shared schema for "book metadata" between the reader and host applications.

The cleanest fix is to extract the common piece — typed OPF / ComicInfo metadata — into a new package in this monorepo and have the streamer consume it. oboku then drops its own package and depends on this one.

## Goals

- One canonical `OpfMetadata` type (and `ComicInfoMetadata`) used by `@prose-reader/streamer` and any host application.
- One implementation of OPF metadata reading, namespace-correct, sax-based, SW-safe.
- Optional, separately-imported writer for hosts that need to mutate metadata in-place (oboku fixer).
- No performance regression for the streamer's hot paths (manifest / spine / nav / ncx parsing stays as-is on `xmldoc`).
- No new runtime dependency added to `@prose-reader/streamer`.

## Non-goals

- **Not** unifying the structural OPF parsing (`manifest`, `spine`, `guide`, `nav`, `ncx`). That's reader-specific and stays in `@prose-reader/streamer`.
- **Not** replacing `xmldoc` with `@xmldom/xmldom` in the streamer. `xmldom` is ~5× larger and ~3–5× slower; the streamer's perf principles in `AGENTS.md` argue against it. The new package is also `xmldoc`-based on the read side.
- **Not** making this an "EPUB metadata library for the world". Scope is what prose-reader and its hosts already need.

## Proposed package: `@prose-reader/archive-metadata`

### Layout

```
packages/archive-metadata/
  src/
    index.ts                 // re-exports types only
    types.ts                 // OpfMetadata, ComicInfoMetadata, Identifier, Contributor, ...
    opf/
      read.ts                // readOpfMetadata(xml: string): OpfMetadata
      write.ts               // writeOpfMetadata(xml: string, patch: Partial<OpfMetadata>): string
    comicInfo/
      read.ts                // readComicInfoMetadata(xml: string): ComicInfoMetadata
      write.ts               // writeComicInfoMetadata(xml, patch): string
    utils/
      namespaces.ts          // xmlns:* → URI helper, qualified-name lookup
      normalizeIsbn.ts       // moved from oboku
  package.json
```

### `package.json` exports

```jsonc
{
  "name": "@prose-reader/archive-metadata",
  "exports": {
    ".":       { "import": "./dist/index.js" },
    "./read":  { "import": "./dist/read.js"  },
    "./write": { "import": "./dist/write.js" }
  },
  "dependencies": {
    "xmldoc": "^2.0.0"
  },
  "optionalDependencies": {
    "@xmldom/xmldom": "^0.9.5"
  }
}
```

- `./read` pulls only `xmldoc`. Used by the streamer and by any read-only host (e.g. oboku's API).
- `./write` pulls `@xmldom/xmldom`. Used only by hosts that mutate metadata (oboku web fixer). Lazy-importable so the ~40 KB stays off main chunks.
- `index.ts` exports types only so `import type { OpfMetadata } from "@prose-reader/archive-metadata"` is dependency-free.

### Read implementation notes

- Single pass over the OPF with `xmldoc`.
- Namespace handling is done **properly** by walking `xmlns:*` attributes on the root element, building a prefix→URI map, and looking up elements by URI + local name. This fixes the prefix-sniff approach currently used in `packages/streamer/src/parsers/nav.ts`:

  ```ts
  // current streamer code
  const rootTagName = ncxData.name
  let prefix = ``
  if (rootTagName.indexOf(`:`) !== -1) {
    prefix = `${rootTagName.split(`:`)[0]}:`
  }
  ```

  which silently breaks for documents that bind the namespace to a different prefix or use the default namespace.

- Surface includes: title, language, identifiers (ISBN-aware via `normalizeIsbn`), contributors (creator/contributor + role), publisher, date, description, subjects, rights, cover-image href (resolved via `meta[name=cover]` / `properties=cover-image`), rendition layout/flow/spread.
- Returns a plain serializable object. No live DOM references escape the package.

### Write implementation notes

- Round-trip preserving: parse with `@xmldom/xmldom`, mutate the requested elements only, serialize back. Don't reformat untouched parts.
- Mutation primitives match the current `oboku/packages/archive-metadata`: upsert child element by tag (with namespace), remove on `undefined`/empty, preserve siblings.
- Same surface as the read API (whatever you can read, you can write).

## Migration in `@prose-reader/streamer`

1. Add `@prose-reader/archive-metadata` as a workspace dep.
2. In `packages/streamer/src/generators/manifest/hooks/epub/epub.ts`, replace the `dc:title` + rendition `meta` extraction block with a single call to `readOpfMetadata(opfXmlString)` and pull values off the typed result.
3. Same in `kobo.ts` / `apple.ts` for any metadata fields they read (structure stays as-is on `xmldoc`).
4. Expose the `OpfMetadata` type on `Manifest` (or a sub-field) so consumers downstream of the streamer don't need to re-parse to learn the title / authors / cover.
5. Backfill: add a streamer test that asserts the metadata in `Manifest` matches `readOpfMetadata` for a representative corpus (use existing test EPUBs).

The streamer continues to parse the OPF once with `xmldoc` for structure and runs `readOpfMetadata` on the same string. No double XML parsing of materially different shape; no new runtime dep; no SW-incompatible APIs.

## Migration in oboku (out of scope for this ticket, recorded for context)

- Replace `@oboku/archive-metadata` imports with `@prose-reader/archive-metadata/read` (or `/write` lazily).
- Delete `apps/web/src/books/metadataFixer/archiveFile.ts`'s xmldom-only path; keep the `archiveFile` shape but back it with the new package.
- Net effect on oboku-web bundle: `@xmldom/xmldom` (~40 KB gzipped) leaves the main chunk and only loads when the user actually applies a metadata fix.

## Risks / open questions

- **Cross-repo dev loop.** prose-reader and oboku are separate repos. Iterating on `@prose-reader/archive-metadata` while consuming it from oboku requires publish + version bumps (or `npm link`). Same friction we already have for `@prose-reader/core`, but worth flagging.
- **Schema is now part of prose-reader's public API.** Once `OpfMetadata` is exported, changing it is a semver event. Keep the schema strictly OPF-spec-defined; host-specific concepts (oboku's "metadata source policy", per-app preferences) stay in the host.
- **ComicInfo placement.** If `@prose-reader/streamer` doesn't currently parse `ComicInfo.xml`, including the type/reader in this package is fine but the dep is single-consumer (oboku) for now. Acceptable, just documenting it.
- **xmldoc namespace handling.** `xmldoc` doesn't know about XML namespaces natively; the package needs to provide a small helper. This is the same problem the streamer already works around, so the helper benefits both sides.
- **No serializer in `xmldoc`.** That's why writes are deliberately on `@xmldom/xmldom` and behind a separate import path. Reimplementing a serializer on top of `xmldoc` would be more code than just paying for `xmldom` on the write path.

## Acceptance criteria

- `@prose-reader/archive-metadata` exists in this monorepo with `read` / `write` / type-only entry points.
- `@prose-reader/streamer` consumes `readOpfMetadata` for OPF metadata extraction; structural parsing unchanged.
- No new runtime dependency on `@prose-reader/streamer`'s own `package.json`.
- Existing streamer test fixtures pass; new tests cover namespace prefix variations (default namespace, non-`dc` prefix on Dublin Core, missing prefix declarations).
- Changelog / migration note for downstream consumers (oboku, demo apps).

## Out of scope

- Migration of oboku to the new package (separate ticket on the oboku side).
- Replacing `xmldoc` with anything else in the streamer's hot paths.
- Switching on native browser `DOMParser` in any environment (the streamer runs in Service Workers where it's unavailable).


## Design alternatives considered

During scoping we considered moving XML parsing out of the package, so the package would only own normalization (consumer parses OPF however they want and hands in a typed JSON shape). Recording the trade-offs here so the chosen direction is explicit and we don't relitigate it later.

### Option A — Package owns XML parsing and normalization (the proposal above)

Consumer passes an OPF string in, gets a typed `OpfMetadata` back.

- **Pros**
  - Single, mechanical migration in consumers: streamer's `epub.ts` (and kobo/apple/etc.) deletes its `dc:title` / rendition `meta` / identifier extraction in favor of one function call.
  - Namespace handling, refines resolution, identifier scheme rules, cover-image fallback, OPF date parsing, MARC role codes — all live in one place, tested once.
  - Clear "this package's job" framing: you give it a buffer, it tells you what's in the book.
- **Cons**
  - Package takes a runtime XML dep (`xmldoc` on the read side). Streamer already depends on `xmldoc`, so no new dep there; oboku consumers also gain `xmldoc` (~7 KB gzipped) but in oboku-web's case lose `@xmldom/xmldom` (~40 KB gzipped) from the main chunk in exchange — net bundle improvement.
  - Schema becomes part of `@prose-reader/archive-metadata`'s public API surface.

### Option B — Package owns *only* normalization (JSON in / JSON out)

Consumer parses XML themselves, builds a `RawOpfMetadata` JSON shape, hands it to `normalizeOpf`. Package has zero XML dep.

- **Pros**
  - Package has no runtime dep at all (types + pure functions).
  - Consumer keeps full control over which XML library runs in their environment (xmldoc in SW, native `DOMParser` in browser, xmldom in Node, etc.).
  - Trivially testable: pure JSON → JSON.
- **Cons**
  - The boundary lands in an awkward place. To produce `RawOpfMetadata`, the consumer already has to find `<metadata>`, iterate its children, read `<dc:title>` text + `id` + `xml:lang`, read `<dc:identifier>` + `opf:scheme`, walk `<meta property="…" refines="#…">` blocks, etc. That's most of the OPF-reading complexity. After all that, the normalization the package provides is a small leftover.
  - Each consumer's adapter is roughly the size of "just doing it yourself," so the duplication this is supposed to eliminate doesn't actually shrink much.
  - "What does this package do?" loses a clean answer.

### Option C — No package; share types + utilities via `@prose-reader/shared`

Export `OpfMetadata`, `ComicInfoMetadata` types and a few pure utilities (`normalizeIsbn`, `marcRoleCodes`, `parseOpfDate`, maybe `pickPrimaryTitle`, `resolveCoverImageItemId`) from the existing shared package. Each app keeps its own OPF reader.

- **Pros**
  - Smallest possible change, no new package to set up / version / publish / document.
  - Consolidates the genuinely tricky spec bits (ISBN normalization, refines resolution rules, MARC roles) without imposing a parser choice on anyone.
  - The post-migration diff in `epub.ts` is small but clean: replace local helper imports with shared ones.
- **Cons**
  - Each consumer still owns its own OPF reader code; the duplication is reduced, not eliminated.
  - Two `OpfMetadata`-shaped types still exist if we're not careful (one in shared, one as the actual return type of each app's reader). Discipline-required, not enforced.
  - Doesn't address oboku-web's main-chunk `@xmldom/xmldom` cost on its own — that requires the lazy-load fix regardless.

### Decision

**Going with Option A.** Reasoning:

- The streamer's `epub.ts` (and kobo/apple/etc.) keeps re-implementing OPF metadata extraction in slightly different ways, and so does oboku in two more places. Option A's migration is the only one that produces a meaningful deletion diff in consumers — that's the test for "is this package pulling its weight?"
- Bundle math works out: streamer's `xmldoc` dep is already present, so the package adds zero runtime cost there. Oboku's main chunk drops `@xmldom/xmldom` and the package keeps `xmldoc` runtime cost contained to the read path. Writes (oboku web fixer only) lazy-load the `/write` entry, so `@xmldom/xmldom` lands on the fixer-write chunk only.
- Option B was rejected because the package's value proposition collapses once parsing is excluded — the consumer ends up doing most of the work and the package's reason to exist becomes hand-wavy.
- Option C remains a fine fallback if Option A migration turns out to be more work than expected. The two are not mutually exclusive: if A stalls, lifting just the type and a few utilities into `@prose-reader/shared` is a strictly smaller version of the same idea.

### What this means for the package boundary

- **In scope:** OPF read (string → `OpfMetadata`), ComicInfo read, ISBN/role/date normalization rules, optional OPF write entry behind a separate import path.
- **Out of scope:** structural OPF parsing (`manifest`/`spine`/`guide`/`nav`/`ncx`) — stays in `@prose-reader/streamer`. Browser-native `DOMParser` selection — not relevant since the streamer runs in Service Workers.
- **Public API contract:** `OpfMetadata`, `ComicInfoMetadata`, and the read/write functions. Treat schema changes as semver events.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

consolidation metadata fixer #417

Extract a shared `@prose-reader/archive-metadata` package

Motivation

Goals

Non-goals

Proposed package: `@prose-reader/archive-metadata`

Layout

`package.json` exports

Read implementation notes

Write implementation notes

Migration in `@prose-reader/streamer`

Migration in oboku (out of scope for this ticket, recorded for context)

Risks / open questions

Acceptance criteria

Out of scope

Design alternatives considered

Option A — Package owns XML parsing and normalization (the proposal above)

Option B — Package owns only normalization (JSON in / JSON out)

Option C — No package; share types + utilities via `@prose-reader/shared`

Decision

What this means for the package boundary

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Uh oh!

consolidation metadata fixer #417

Description

Extract a shared @prose-reader/archive-metadata package

Motivation

Goals

Non-goals

Proposed package: @prose-reader/archive-metadata

Layout

package.json exports

Read implementation notes

Write implementation notes

Migration in @prose-reader/streamer

Migration in oboku (out of scope for this ticket, recorded for context)

Risks / open questions

Acceptance criteria

Out of scope

Design alternatives considered

Option A — Package owns XML parsing and normalization (the proposal above)

Option B — Package owns only normalization (JSON in / JSON out)

Option C — No package; share types + utilities via @prose-reader/shared

Decision

What this means for the package boundary

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions

Extract a shared `@prose-reader/archive-metadata` package

Proposed package: `@prose-reader/archive-metadata`

`package.json` exports

Migration in `@prose-reader/streamer`

Option C — No package; share types + utilities via `@prose-reader/shared`