Skip to content

Font Subsetting

ABCrimson edited this page Mar 1, 2026 · 5 revisions

Font Subsetting

modern-pdf-lib v0.15.1 — added in v0.12.0

Font subsetting is the process of stripping all glyph outlines that are not actually used in a document from the embedded font program. For a large CJK font this can be the difference between a 10 MB PDF and a 50 KB one for a document that draws only five characters. Subsetting is on by default and requires no configuration.


How it works

The subsetter lives in src/assets/font/ttfSubset.ts and operates entirely on raw Uint8Array data — no Node.js Buffer, no filesystem access.

Step 1: Parse the table directory

TrueType fonts begin with an offset table followed by a sequence of 16-byte table records. The subsetter reads the number of tables from byte offset 4, then iterates the records to build a Map<tag, { offset, length }>:

Offset table header (12 bytes)
  sfVersion  numTables  searchRange  entrySelector  rangeShift

Table directory (numTables × 16 bytes each)
  tag[4]  checkSum  offset  length

The four tables required for subsetting are head, loca, maxp, and glyf. If any of these are missing the font is not a TrueType outline font and the bytes are returned unchanged.

Step 2: Parse loca and glyf

The loca table maps each glyph ID to a byte range within the glyf table. It exists in two formats:

head.indexToLocFormat Entry size Max glyf size
0 (short) 2 bytes × 2 131,070 bytes
1 (long) 4 bytes 4 GB

The subsetter reads numGlyphs + 1 offsets, producing a closed interval [locaOffsets[gid], locaOffsets[gid+1]) for each glyph.

Step 3: Resolve composite glyph dependencies

Simple glyphs contain their own contours. Composite glyphs (e.g. accented letters built from base character + diacritic) reference other glyphs via component records. The subsetter scans every retained glyph for the MORE_COMPONENTS flag and recursively adds referenced component IDs to the retained set:

// composite glyph flag bits (TrueType spec)
const ARG_1_AND_2_ARE_WORDS  = 0x0001; // argument size
const WE_HAVE_A_SCALE        = 0x0008; // 1 × F2Dot14 transform
const MORE_COMPONENTS        = 0x0020; // another component follows
const WE_HAVE_AN_X_AND_Y_SCALE = 0x0040; // 2 × F2Dot14
const WE_HAVE_A_TWO_BY_TWO   = 0x0080; // 4 × F2Dot14 (matrix)

This resolution is transitive: a composite that references another composite will pull in all of its components as well.

Step 4: Build the new glyf table

Two passes over the glyph range:

  1. Calculate size — sum the byte lengths of retained glyphs, each 2-byte aligned.
  2. Copy data — copy retained glyph bytes into a new Uint8Array; unused glyph slots get a zero-length entry (their loca offset equals the next glyph's offset).

Crucially, glyph IDs are not renumbered. Every slot 0 … numGlyphs-1 is present in the output; unused slots simply have no data. This means the existing CIDToGIDMap /Identity encoding (CID = GID) continues to work without any changes to the PDF embedding pipeline.

Step 5: Rebuild loca

New offsets are written for all numGlyphs + 1 entries. The subsetter chooses short format when the total glyf size fits within 131,070 bytes and updates head.indexToLocFormat accordingly.

Step 6: Reassemble and checksum

All original tables are copied into the output font in a canonical order. The head and loca / glyf tables use the new versions. Each table's checksum is recalculated, and the whole-file checkSumAdjustment field in head is set to 0xB1B0AFBA − sum(all table checksums) as required by the TrueType specification.


Tables preserved vs rebuilt

Action Tables
Copied as-is hhea, maxp, OS/2, name, cmap, post, cvt , fpgm, prep, hmtx, gasp, GDEF, GPOS, GSUB
Modified headcheckSumAdjustment zeroed then recomputed; indexToLocFormat updated
Rebuilt glyf — only retained glyph data; loca — new offsets

Tables not listed (e.g. kern, vhea, proprietary vendor tables) are dropped. This keeps the output font compact and well-formed for all PDF viewers.


The /ToUnicode CMap

After subsetting, buildSubsetCmap() constructs a PDF /ToUnicode CMap stream that maps each new CID back to its Unicode codepoint. This enables text extraction and copy-paste in PDF viewers and is required for PDF/A compliance. The CMap is emitted in beginbfchar / endbfchar sections, 100 entries per section (the PDF specification limit).


Subset font name tagging

The PDF specification recommends prefixing the embedded font name with a six-character uppercase tag followed by + to indicate that the font is subsetted, for example BCDEFG+NotoSansCJK. The tag is computed deterministically from the set of retained glyph IDs:

// From src/assets/font/fontSubset.ts
export function computeSubsetTag(usedGlyphIds: Set<number>): string {
  let hash = 0;
  for (const gid of usedGlyphIds) {
    hash = ((hash << 5) - hash + gid) | 0;
  }
  const tag: string[] = [];
  let h = Math.abs(hash);
  for (let i = 0; i < 6; i++) {
    tag.push(String.fromCharCode(65 + (h % 26)));
    h = Math.floor(h / 26);
  }
  return tag.join('');
}

OTF (CFF-based OpenType) fonts

OpenType fonts whose outlines are stored in a CFF table (instead of a glyf table) are identified by isOpenTypeCFF() from src/assets/font/otfDetect.ts. The embedding pipeline extracts the raw CFF data via findTable(data, 'CFF ') and embeds it as a /FontFile3 stream with /Subtype /CIDFontType0C.

CFF fonts are embedded without subsetting in the current release. The full CFF table is always included. Subsetting CFF data requires parsing the CFF binary format (INDEX structures, charstrings, subroutines), which is planned for a future release.


Size impact

Font Full size 5 characters Reduction
Noto Sans CJK SC (Regular) ~10 MB ~50 KB 99.5%
Noto Serif (Latin) ~550 KB ~18 KB 96.7%
Roboto Regular ~135 KB ~12 KB 91.1%

API reference

import { createPdf } from 'modern-pdf-lib';

const doc = createPdf();

// Subsetting on by default
const font = await doc.embedFont(fontBytes);

// Opt out of subsetting (embeds the full font program)
const fullFont = await doc.embedFont(fontBytes, { subset: false });

// Embed an OTF/CFF font — CFF data extracted, no subsetting
const otfFont = await doc.embedFont(otfBytes);

The EmbedFontOptions interface:

interface EmbedFontOptions {
  /** Subset the font to used glyphs only. Default: true. */
  subset?: boolean;
  /** Custom name for the font resource. */
  customName?: string;
  /** Enable OpenType GSUB features (e.g. ligatures). Requires shaping WASM. */
  features?: string[];
}

Subsetting happens lazily at doc.save() time, after all text has been drawn, so the subsetter sees the complete set of used glyphs.

Clone this wiki locally