pdf-go API overview

This library is a Go subset for reading and writing PDF. Module path: github.com/lightningrag/pdf-go/pdf.

The following describes this repository’s behavior and ISO 32000–related conventions. It is not a cross-toolkit compatibility or migration guide for other PDF implementations.

Requirements

Go 1.20+
Standard library only at runtime (no third-party runtime dependencies)

Sample programs live under examples/ (inspect, readtext, readtextadvanced, diagtext, pages, text, merge, outlines, encryptcheck, links, docinfo, fileranges; see examples/README.md). Manifest-driven integration testing is described in docs/TESTING.md.

Opening and reading

import "github.com/lightningrag/pdf-go/pdf"

r, err := pdf.OpenFile("document.pdf", false)
if err != nil {
    log.Fatal(err)
}
defer func() { /* current implementation buffers the file in memory; no explicit Close */ }()

n, err := r.NumPages()
if err != nil {
	log.Fatal(err)
}
pages, err := r.Pages() // returns all flattened pages in one call
if err != nil {
	log.Fatal(err)
}
p, err := r.Page(0)
if err != nil {
	log.Fatal(err)
}
_ = pages[0] == p
_ = p.Dict // *generic.Dict page dictionary (merged inherited /MediaBox, etc.)

OpenFile(path, strict bool): open from path; strict reserves stricter validation (partially enforced).
OpenFileWithPolicy(path, strict, policy): like OpenFile, but pdf.AllowEncryptedOpen may return a reader when the trailer has /Encrypt; default pdf.RejectEncrypted still returns pdf.ErrEncrypted.
NewPdfReader(rs io.ReadSeeker, strict bool): open from a seekable stream (full buffer read).
NewPdfReaderWithPolicy(rs, strict, policy): same with optional AllowEncryptedOpen.
r.IsEncrypted(): true when the trailer had /Encrypt at open time (meaningful with AllowEncryptedOpen). No decryption; encrypted streams may still fail to read.
r.EncryptPermissions(): when /Encrypt is parseable, returns raw /P (indirect refs resolved); no decryption. ok false when not encrypted, missing /P, or non-numeric.
r.UserAccessPermissions(): /P as pdf.UserAccessPermissions (int32 alias). DecodeMap(), Has(mask) for bit tests.
r.ArePermissionsValid(): permission validity signal; without decryption and /Perms checks, encrypted docs may report ok false.
r.Pages(): flattened page slice in order consistent with Page(0…n-1).
r.GetNumPages() / r.GetPage(i): aliases for NumPages() / Page(i) (zero-based i).
pdf.DecodePermissions(p int32): expands /P into a boolean map; constants pdf.UserPermPrint, etc.
Encrypted PDF (default): OpenFile / NewPdfReader return pdf.ErrEncrypted when /Encrypt is present and policy rejects.

Low-level objects

r.Trailer(): merged trailer dict (*generic.Dict).
r.RootObject() / r.Root(): catalog (/Catalog) dictionary.
r.GetObject(idNum, generation int): resolve indirect objects (generic.PDF).
r.Raw(): original file bytes.
r.TrailerSize(): merged trailer /Size (xref upper bound).
r.TrailerDocumentID(): trailer /ID (two PDF strings); ok false if missing or ill-typed.
r.CatalogVersion(): optional catalog /Version name (leading / stripped).
r.OpenActionPageIndex(): zero-based page from catalog /OpenAction, or -1 when not applicable.
r.PageLayout(), r.PageMode(), r.ViewerPreferences(), r.ViewerPreferencesInfo(): catalog keys per ISO 32000; typed accessors return (value, ok).
r.CatalogLang(), r.MarkInfo(), r.AcroForm(): catalog /Lang, /MarkInfo, /AcroForm.
r.XFA(): decoded /AcroForm / XFA streams as map[string][]byte.
r.CatalogThreads(), r.Threads(): catalog /Threads.
r.OpenAction(), r.OpenActionNamedDestination(): raw /OpenAction objects / named-dest text.
r.PageLabels() / r.PageLabel(i): ISO 32000 §12.4.2 page labels; default decimal "1"…"n" without /PageLabels.
r.HasXRefEntry, r.InUseObjectIDsGen0(), r.PagesRoot(): xref and unresolved page-tree helpers.

Document info and page geometry

r.DocumentInformation() / r.Metadata(): trailer /Info; use TextWithReader / TitleWithReader when values are indirect refs. Raw accessors return unresolved generic.Object.
pdf.ParsePDFDate, CreationDateTimeWithReader, ModDateTimeWithReader: PDF date strings → time.Time where applicable.
p.MediaBox() … p.BleedBox(): box arrays per ISO 32000 (inheritance from /Pages when flattening).
p.MediaBoxSize(), p.CropBoxSize(), … p.ArtBoxSize(), Effective*Box(), MediaBoxRect() …: derived sizes and RectangleObject helpers.
p.Resources(), p.Annots(), p.Annotations(), p.AnnotDicts(): page resources and annotations.
r.DestinationPageIndex, r.NamedDestinations(), r.NamedDestRoot(), r.EmbeddedFiles() / attachments APIs: destinations and embedded files (ISO 32000 §7.11).
pdf.LinkAnnotationTarget: URI / GoTo / GoToR / Launch extraction from /Subtype /Link.
p.Rotate(), p.RotationDegrees(), p.UserUnit(), p.UserUnitOrDefault(), p.PageNumber(): rotation and user unit.
p.ContentsObject() / p.GetContentsObject(), p.ContentsBytes() / p.GetContentsBytes(): raw vs decoded /Contents.
r.DecodeStream: apply the reader’s filter chain to a stream.
r.PDFHeaderPrefix(), r.PDFHeader(), r.PDFVersion(), pdf.Version: file header and module version.
r.XMPMetadata(), r.Outlines(), r.Outline(): XMP package bytes and outline tree.
r.GetPageNumber, r.GetDestinationPageNumber, r.FormFields(), r.FormTextFields(), r.PagesShowingField(), p.GetFonts(): forms and font listing helpers.
p.RotatePage, p.SetRotation, p.AddTransformation, p.ScalePage*, p.CompressContentStreams(), p.TransferRotationToContent(): page transforms and content compression.
p.ExtractText(): heuristic text scan (no full font/CMap pipeline).

`ExtractText` vs `ExtractTextAdvanced`

	`Page.ExtractText()`	`Page.ExtractTextAdvanced(opts)`
Role	lightweight heuristic	full-stream text with ToUnicode / widths
Unicode	no ToUnicode / CMap	uses page resources and font maps
Content	literal scan	structured ops; optional Form XObject text
Layout	fixed heuristics	`Orientations`, `SpaceWidth`, visitor hooks

Use ExtractTextAdvanced when you need better Unicode coverage; ExtractText for quick English or debugging. See examples/readtextadvanced, examples/readtext, examples/diagtext.

p.ExtractTextAdvanced(opts): see ExtractTextOptions in source for all fields.

`pdf/generic`

Core syntax types: Null, Bool, Number, Name, *StringObj, Array, *Dict, *Stream, *Indirect.

generic.ReadObject, generic.WriteObject: parse / emit PDF syntax (writer subset).
d.Del(key): delete a dict entry (writer helpers).

`PdfWriter`

Hand-built object graph: AddObject in dependency order; last object must be /Catalog; trailer /Root points to it. Trailer /ID by default; set OmitTrailerDocumentID for deterministic tests.
Append from reader: AppendPagesFromReader / AppendPagesFromReaderPageRange deep-clone pages; non-empty writers require catalog-at-end conventions (see source Bytes() validation).

Writer APIs include blank/add/insert/remove pages, catalog metadata, attachments, JavaScript name tree, /OpenAction, named destinations, URI links, page labels, threads, outlines, merge helpers, Remove*Writer utilities, Bytes() / Write() with ErrWriterNoObjects, ErrWriterLastNotCatalog, etc.

Page merge and content removal

MergePage, MergeTransformedPage, MergeScaledPage, MergeRotatedPage, MergeTranslatedPage, ReplaceContents: combine or replace page content streams.
RemoveAnnotations, RemoveLinks, RemoveImages, RemoveText: strip operators or annotations from a page.

`pdf/filters`

FlateDecode / FlateEncode, ASCIIHexDecode, ASCII85Decode, RunLengthDecode, LZWDecode, DCTDecode, JPXDecode, JBIG2Decode, CCITTFaxDecode (image filters return compressed payload; raster decode not implemented), ApplyPostFlatePredictor.

The reader’s decodeStreamData recognizes long and short filter names per ISO 32000.

Errors

Exported error values include ErrEmptyFile, ErrParse, ErrNotPDF, ErrEncrypted, ErrNoContents, ErrPageSizeNotDefined, page-range merge errors, ErrObjectNotFound, etc.; ReadError / PdfReadError wrap richer failures.

Scope

No full encryption pipeline, JavaScript execution, or raster image decode; ExtractText* are heuristics, not a layout engine. AppendPagesFromReader non-empty merge targets PDFs produced with this writer’s catalog-at-end layout.

`ImageType`

Bit flags for future image enumeration APIs (ImageTypeNone, ImageTypeXObjectImages, …).

Paper sizes

PaperA0…PaperA8, PaperC4: ISO 216 (and related) sizes in PDF points (72 dpi).

Text matrices

pdf.Mult, pdf.MatrixMultiply: multiply PDF text-space matrices.

`Transformation`

NewTransformation, Translate, Scale, Rotate, Transform, ApplyOn, Matrix, CompressMatrix, ToCM(): 2D transforms and cm operator strings.

`ConvertToInt`

pdf.ConvertToInt: big-endian signed int from the last size bytes (xref /W entries); used by readWInt.

Page ranges (Python-like slices)

pdf.ParsePageRange: strings such as :, 0:3, 5:, :-1, ::2, -1, -2.
PageRange.Indices, String, Equal, MergePageRanges, ValidPageRange, ParseFilenamePageRanges, PageRangeAll.

For exact semantics and edge cases, prefer go doc and the pdf package tests as the source of truth.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pdf-go API overview

Requirements

Opening and reading

Low-level objects

Document info and page geometry

`ExtractText` vs `ExtractTextAdvanced`

`pdf/generic`

`PdfWriter`

Page merge and content removal

`pdf/filters`

Errors

Scope

`ImageType`

Paper sizes

Text matrices

`Transformation`

`ConvertToInt`

Page ranges (Python-like slices)

FilesExpand file tree

API.md

Latest commit

History

API.md

File metadata and controls

pdf-go API overview

Requirements

Opening and reading

Low-level objects

Document info and page geometry

ExtractText vs ExtractTextAdvanced

pdf/generic

PdfWriter

Page merge and content removal

pdf/filters

Errors

Scope

ImageType

Paper sizes

Text matrices

Transformation

ConvertToInt

Page ranges (Python-like slices)

`ExtractText` vs `ExtractTextAdvanced`

`pdf/generic`

`PdfWriter`

`pdf/filters`

`ImageType`

`Transformation`

`ConvertToInt`