-
Notifications
You must be signed in to change notification settings - Fork 1
Font Subsetting
modern-pdf-lib v0.15.1 — added in v0.12.0
Font subsetting is the process of stripping all glyph outlines that are not actually used in a document from the embedded font program. For a large CJK font this can be the difference between a 10 MB PDF and a 50 KB one for a document that draws only five characters. Subsetting is on by default and requires no configuration.
The subsetter lives in src/assets/font/ttfSubset.ts and operates entirely on
raw Uint8Array data — no Node.js Buffer, no filesystem access.
TrueType fonts begin with an offset table followed by a sequence of 16-byte
table records. The subsetter reads the number of tables from byte offset 4,
then iterates the records to build a Map<tag, { offset, length }>:
Offset table header (12 bytes)
sfVersion numTables searchRange entrySelector rangeShift
Table directory (numTables × 16 bytes each)
tag[4] checkSum offset length
The four tables required for subsetting are head, loca, maxp, and glyf.
If any of these are missing the font is not a TrueType outline font and the
bytes are returned unchanged.
The loca table maps each glyph ID to a byte range within the glyf table.
It exists in two formats:
head.indexToLocFormat |
Entry size | Max glyf size |
|---|---|---|
0 (short) |
2 bytes × 2 | 131,070 bytes |
1 (long) |
4 bytes | 4 GB |
The subsetter reads numGlyphs + 1 offsets, producing a closed interval
[locaOffsets[gid], locaOffsets[gid+1]) for each glyph.
Simple glyphs contain their own contours. Composite glyphs (e.g. accented
letters built from base character + diacritic) reference other glyphs via
component records. The subsetter scans every retained glyph for the MORE_COMPONENTS
flag and recursively adds referenced component IDs to the retained set:
// composite glyph flag bits (TrueType spec)
const ARG_1_AND_2_ARE_WORDS = 0x0001; // argument size
const WE_HAVE_A_SCALE = 0x0008; // 1 × F2Dot14 transform
const MORE_COMPONENTS = 0x0020; // another component follows
const WE_HAVE_AN_X_AND_Y_SCALE = 0x0040; // 2 × F2Dot14
const WE_HAVE_A_TWO_BY_TWO = 0x0080; // 4 × F2Dot14 (matrix)This resolution is transitive: a composite that references another composite will pull in all of its components as well.
Two passes over the glyph range:
- Calculate size — sum the byte lengths of retained glyphs, each 2-byte aligned.
-
Copy data — copy retained glyph bytes into a new
Uint8Array; unused glyph slots get a zero-length entry (theirlocaoffset equals the next glyph's offset).
Crucially, glyph IDs are not renumbered. Every slot 0 … numGlyphs-1
is present in the output; unused slots simply have no data. This means the
existing CIDToGIDMap /Identity encoding (CID = GID) continues to work
without any changes to the PDF embedding pipeline.
New offsets are written for all numGlyphs + 1 entries. The subsetter
chooses short format when the total glyf size fits within 131,070 bytes
and updates head.indexToLocFormat accordingly.
All original tables are copied into the output font in a canonical order.
The head and loca / glyf tables use the new versions. Each table's
checksum is recalculated, and the whole-file checkSumAdjustment field in
head is set to 0xB1B0AFBA − sum(all table checksums) as required by
the TrueType specification.
| Action | Tables |
|---|---|
| Copied as-is |
hhea, maxp, OS/2, name, cmap, post, cvt , fpgm, prep, hmtx, gasp, GDEF, GPOS, GSUB
|
| Modified |
head — checkSumAdjustment zeroed then recomputed; indexToLocFormat updated |
| Rebuilt |
glyf — only retained glyph data; loca — new offsets |
Tables not listed (e.g. kern, vhea, proprietary vendor tables) are
dropped. This keeps the output font compact and well-formed for all
PDF viewers.
After subsetting, buildSubsetCmap() constructs a PDF /ToUnicode CMap
stream that maps each new CID back to its Unicode codepoint. This enables
text extraction and copy-paste in PDF viewers and is required for PDF/A
compliance. The CMap is emitted in beginbfchar / endbfchar sections,
100 entries per section (the PDF specification limit).
The PDF specification recommends prefixing the embedded font name with a
six-character uppercase tag followed by + to indicate that the font is
subsetted, for example BCDEFG+NotoSansCJK. The tag is computed
deterministically from the set of retained glyph IDs:
// From src/assets/font/fontSubset.ts
export function computeSubsetTag(usedGlyphIds: Set<number>): string {
let hash = 0;
for (const gid of usedGlyphIds) {
hash = ((hash << 5) - hash + gid) | 0;
}
const tag: string[] = [];
let h = Math.abs(hash);
for (let i = 0; i < 6; i++) {
tag.push(String.fromCharCode(65 + (h % 26)));
h = Math.floor(h / 26);
}
return tag.join('');
}OpenType fonts whose outlines are stored in a CFF table (instead of a
glyf table) are identified by isOpenTypeCFF() from src/assets/font/otfDetect.ts.
The embedding pipeline extracts the raw CFF data via findTable(data, 'CFF ')
and embeds it as a /FontFile3 stream with /Subtype /CIDFontType0C.
CFF fonts are embedded without subsetting in the current release. The full CFF table is always included. Subsetting CFF data requires parsing the CFF binary format (INDEX structures, charstrings, subroutines), which is planned for a future release.
| Font | Full size | 5 characters | Reduction |
|---|---|---|---|
| Noto Sans CJK SC (Regular) | ~10 MB | ~50 KB | 99.5% |
| Noto Serif (Latin) | ~550 KB | ~18 KB | 96.7% |
| Roboto Regular | ~135 KB | ~12 KB | 91.1% |
import { createPdf } from 'modern-pdf-lib';
const doc = createPdf();
// Subsetting on by default
const font = await doc.embedFont(fontBytes);
// Opt out of subsetting (embeds the full font program)
const fullFont = await doc.embedFont(fontBytes, { subset: false });
// Embed an OTF/CFF font — CFF data extracted, no subsetting
const otfFont = await doc.embedFont(otfBytes);The EmbedFontOptions interface:
interface EmbedFontOptions {
/** Subset the font to used glyphs only. Default: true. */
subset?: boolean;
/** Custom name for the font resource. */
customName?: string;
/** Enable OpenType GSUB features (e.g. ligatures). Requires shaping WASM. */
features?: string[];
}Subsetting happens lazily at doc.save() time, after all text has been
drawn, so the subsetter sees the complete set of used glyphs.