This library is a Go subset for reading and writing PDF. Module path: github.com/lightningrag/pdf-go/pdf.
The following describes this repository’s behavior and ISO 32000–related conventions. It is not a cross-toolkit compatibility or migration guide for other PDF implementations.
- Go 1.20+
- Standard library only at runtime (no third-party runtime dependencies)
Sample programs live under examples/ (inspect, readtext, readtextadvanced, diagtext, pages, text, merge, outlines, encryptcheck, links, docinfo, fileranges; see examples/README.md). Manifest-driven integration testing is described in docs/TESTING.md.
import "github.com/lightningrag/pdf-go/pdf"
r, err := pdf.OpenFile("document.pdf", false)
if err != nil {
log.Fatal(err)
}
defer func() { /* current implementation buffers the file in memory; no explicit Close */ }()
n, err := r.NumPages()
if err != nil {
log.Fatal(err)
}
pages, err := r.Pages() // returns all flattened pages in one call
if err != nil {
log.Fatal(err)
}
p, err := r.Page(0)
if err != nil {
log.Fatal(err)
}
_ = pages[0] == p
_ = p.Dict // *generic.Dict page dictionary (merged inherited /MediaBox, etc.)OpenFile(path, strict bool): open from path;strictreserves stricter validation (partially enforced).OpenFileWithPolicy(path, strict, policy): likeOpenFile, butpdf.AllowEncryptedOpenmay return a reader when the trailer has/Encrypt; defaultpdf.RejectEncryptedstill returnspdf.ErrEncrypted.NewPdfReader(rs io.ReadSeeker, strict bool): open from a seekable stream (full buffer read).NewPdfReaderWithPolicy(rs, strict, policy): same with optionalAllowEncryptedOpen.r.IsEncrypted(): true when the trailer had/Encryptat open time (meaningful withAllowEncryptedOpen). No decryption; encrypted streams may still fail to read.r.EncryptPermissions(): when/Encryptis parseable, returns raw/P(indirect refs resolved); no decryption.okfalse when not encrypted, missing/P, or non-numeric.r.UserAccessPermissions():/Paspdf.UserAccessPermissions(int32alias).DecodeMap(),Has(mask)for bit tests.r.ArePermissionsValid(): permission validity signal; without decryption and/Permschecks, encrypted docs may reportokfalse.r.Pages(): flattened page slice in order consistent withPage(0…n-1).r.GetNumPages()/r.GetPage(i): aliases forNumPages()/Page(i)(zero-basedi).pdf.DecodePermissions(p int32): expands/Pinto a boolean map; constantspdf.UserPermPrint, etc.- Encrypted PDF (default):
OpenFile/NewPdfReaderreturnpdf.ErrEncryptedwhen/Encryptis present and policy rejects.
r.Trailer(): merged trailer dict (*generic.Dict).r.RootObject()/r.Root(): catalog (/Catalog) dictionary.r.GetObject(idNum, generation int): resolve indirect objects (generic.PDF).r.Raw(): original file bytes.r.TrailerSize(): merged trailer/Size(xref upper bound).r.TrailerDocumentID(): trailer/ID(two PDF strings);okfalse if missing or ill-typed.r.CatalogVersion(): optional catalog/Versionname (leading/stripped).r.OpenActionPageIndex(): zero-based page from catalog/OpenAction, or-1when not applicable.r.PageLayout(),r.PageMode(),r.ViewerPreferences(),r.ViewerPreferencesInfo(): catalog keys per ISO 32000; typed accessors return(value, ok).r.CatalogLang(),r.MarkInfo(),r.AcroForm(): catalog/Lang,/MarkInfo,/AcroForm.r.XFA(): decoded/AcroForm/XFAstreams asmap[string][]byte.r.CatalogThreads(),r.Threads(): catalog/Threads.r.OpenAction(),r.OpenActionNamedDestination(): raw/OpenActionobjects / named-dest text.r.PageLabels()/r.PageLabel(i): ISO 32000 §12.4.2 page labels; default decimal"1"…"n"without/PageLabels.r.HasXRefEntry,r.InUseObjectIDsGen0(),r.PagesRoot(): xref and unresolved page-tree helpers.
r.DocumentInformation()/r.Metadata(): trailer/Info; useTextWithReader/TitleWithReaderwhen values are indirect refs.Rawaccessors return unresolvedgeneric.Object.pdf.ParsePDFDate,CreationDateTimeWithReader,ModDateTimeWithReader: PDF date strings →time.Timewhere applicable.p.MediaBox()…p.BleedBox(): box arrays per ISO 32000 (inheritance from/Pageswhen flattening).p.MediaBoxSize(),p.CropBoxSize(), …p.ArtBoxSize(),Effective*Box(),MediaBoxRect()…: derived sizes andRectangleObjecthelpers.p.Resources(),p.Annots(),p.Annotations(),p.AnnotDicts(): page resources and annotations.r.DestinationPageIndex,r.NamedDestinations(),r.NamedDestRoot(),r.EmbeddedFiles()/ attachments APIs: destinations and embedded files (ISO 32000 §7.11).pdf.LinkAnnotationTarget: URI / GoTo / GoToR / Launch extraction from/Subtype /Link.p.Rotate(),p.RotationDegrees(),p.UserUnit(),p.UserUnitOrDefault(),p.PageNumber(): rotation and user unit.p.ContentsObject()/p.GetContentsObject(),p.ContentsBytes()/p.GetContentsBytes(): raw vs decoded/Contents.r.DecodeStream: apply the reader’s filter chain to a stream.r.PDFHeaderPrefix(),r.PDFHeader(),r.PDFVersion(),pdf.Version: file header and module version.r.XMPMetadata(),r.Outlines(),r.Outline(): XMP package bytes and outline tree.r.GetPageNumber,r.GetDestinationPageNumber,r.FormFields(),r.FormTextFields(),r.PagesShowingField(),p.GetFonts(): forms and font listing helpers.p.RotatePage,p.SetRotation,p.AddTransformation,p.ScalePage*,p.CompressContentStreams(),p.TransferRotationToContent(): page transforms and content compression.p.ExtractText(): heuristic text scan (no full font/CMap pipeline).
Page.ExtractText() |
Page.ExtractTextAdvanced(opts) |
|
|---|---|---|
| Role | lightweight heuristic | full-stream text with ToUnicode / widths |
| Unicode | no ToUnicode / CMap | uses page resources and font maps |
| Content | literal scan | structured ops; optional Form XObject text |
| Layout | fixed heuristics | Orientations, SpaceWidth, visitor hooks |
Use ExtractTextAdvanced when you need better Unicode coverage; ExtractText for quick English or debugging. See examples/readtextadvanced, examples/readtext, examples/diagtext.
p.ExtractTextAdvanced(opts): seeExtractTextOptionsin source for all fields.
Core syntax types: Null, Bool, Number, Name, *StringObj, Array, *Dict, *Stream, *Indirect.
generic.ReadObject,generic.WriteObject: parse / emit PDF syntax (writer subset).d.Del(key): delete a dict entry (writer helpers).
- Hand-built object graph:
AddObjectin dependency order; last object must be/Catalog; trailer/Rootpoints to it. Trailer/IDby default; setOmitTrailerDocumentIDfor deterministic tests. - Append from reader:
AppendPagesFromReader/AppendPagesFromReaderPageRangedeep-clone pages; non-empty writers require catalog-at-end conventions (see sourceBytes()validation).
Writer APIs include blank/add/insert/remove pages, catalog metadata, attachments, JavaScript name tree, /OpenAction, named destinations, URI links, page labels, threads, outlines, merge helpers, Remove*Writer utilities, Bytes() / Write() with ErrWriterNoObjects, ErrWriterLastNotCatalog, etc.
MergePage,MergeTransformedPage,MergeScaledPage,MergeRotatedPage,MergeTranslatedPage,ReplaceContents: combine or replace page content streams.RemoveAnnotations,RemoveLinks,RemoveImages,RemoveText: strip operators or annotations from a page.
FlateDecode / FlateEncode, ASCIIHexDecode, ASCII85Decode, RunLengthDecode, LZWDecode, DCTDecode, JPXDecode, JBIG2Decode, CCITTFaxDecode (image filters return compressed payload; raster decode not implemented), ApplyPostFlatePredictor.
The reader’s decodeStreamData recognizes long and short filter names per ISO 32000.
Exported error values include ErrEmptyFile, ErrParse, ErrNotPDF, ErrEncrypted, ErrNoContents, ErrPageSizeNotDefined, page-range merge errors, ErrObjectNotFound, etc.; ReadError / PdfReadError wrap richer failures.
No full encryption pipeline, JavaScript execution, or raster image decode; ExtractText* are heuristics, not a layout engine. AppendPagesFromReader non-empty merge targets PDFs produced with this writer’s catalog-at-end layout.
Bit flags for future image enumeration APIs (ImageTypeNone, ImageTypeXObjectImages, …).
PaperA0…PaperA8, PaperC4: ISO 216 (and related) sizes in PDF points (72 dpi).
pdf.Mult, pdf.MatrixMultiply: multiply PDF text-space matrices.
NewTransformation, Translate, Scale, Rotate, Transform, ApplyOn, Matrix, CompressMatrix, ToCM(): 2D transforms and cm operator strings.
pdf.ConvertToInt: big-endian signed int from the last size bytes (xref /W entries); used by readWInt.
pdf.ParsePageRange: strings such as:,0:3,5:,:-1,::2,-1,-2.PageRange.Indices,String,Equal,MergePageRanges,ValidPageRange,ParseFilenamePageRanges,PageRangeAll.
For exact semantics and edge cases, prefer go doc and the pdf package tests as the source of truth.