Skip to content

Latest commit

 

History

History
161 lines (115 loc) · 10.3 KB

File metadata and controls

161 lines (115 loc) · 10.3 KB

pdf-go API overview

This library is a Go subset for reading and writing PDF. Module path: github.com/lightningrag/pdf-go/pdf.

The following describes this repository’s behavior and ISO 32000–related conventions. It is not a cross-toolkit compatibility or migration guide for other PDF implementations.

Requirements

  • Go 1.20+
  • Standard library only at runtime (no third-party runtime dependencies)

Sample programs live under examples/ (inspect, readtext, readtextadvanced, diagtext, pages, text, merge, outlines, encryptcheck, links, docinfo, fileranges; see examples/README.md). Manifest-driven integration testing is described in docs/TESTING.md.

Opening and reading

import "github.com/lightningrag/pdf-go/pdf"

r, err := pdf.OpenFile("document.pdf", false)
if err != nil {
    log.Fatal(err)
}
defer func() { /* current implementation buffers the file in memory; no explicit Close */ }()

n, err := r.NumPages()
if err != nil {
	log.Fatal(err)
}
pages, err := r.Pages() // returns all flattened pages in one call
if err != nil {
	log.Fatal(err)
}
p, err := r.Page(0)
if err != nil {
	log.Fatal(err)
}
_ = pages[0] == p
_ = p.Dict // *generic.Dict page dictionary (merged inherited /MediaBox, etc.)
  • OpenFile(path, strict bool): open from path; strict reserves stricter validation (partially enforced).
  • OpenFileWithPolicy(path, strict, policy): like OpenFile, but pdf.AllowEncryptedOpen may return a reader when the trailer has /Encrypt; default pdf.RejectEncrypted still returns pdf.ErrEncrypted.
  • NewPdfReader(rs io.ReadSeeker, strict bool): open from a seekable stream (full buffer read).
  • NewPdfReaderWithPolicy(rs, strict, policy): same with optional AllowEncryptedOpen.
  • r.IsEncrypted(): true when the trailer had /Encrypt at open time (meaningful with AllowEncryptedOpen). No decryption; encrypted streams may still fail to read.
  • r.EncryptPermissions(): when /Encrypt is parseable, returns raw /P (indirect refs resolved); no decryption. ok false when not encrypted, missing /P, or non-numeric.
  • r.UserAccessPermissions(): /P as pdf.UserAccessPermissions (int32 alias). DecodeMap(), Has(mask) for bit tests.
  • r.ArePermissionsValid(): permission validity signal; without decryption and /Perms checks, encrypted docs may report ok false.
  • r.Pages(): flattened page slice in order consistent with Page(0…n-1).
  • r.GetNumPages() / r.GetPage(i): aliases for NumPages() / Page(i) (zero-based i).
  • pdf.DecodePermissions(p int32): expands /P into a boolean map; constants pdf.UserPermPrint, etc.
  • Encrypted PDF (default): OpenFile / NewPdfReader return pdf.ErrEncrypted when /Encrypt is present and policy rejects.

Low-level objects

  • r.Trailer(): merged trailer dict (*generic.Dict).
  • r.RootObject() / r.Root(): catalog (/Catalog) dictionary.
  • r.GetObject(idNum, generation int): resolve indirect objects (generic.PDF).
  • r.Raw(): original file bytes.
  • r.TrailerSize(): merged trailer /Size (xref upper bound).
  • r.TrailerDocumentID(): trailer /ID (two PDF strings); ok false if missing or ill-typed.
  • r.CatalogVersion(): optional catalog /Version name (leading / stripped).
  • r.OpenActionPageIndex(): zero-based page from catalog /OpenAction, or -1 when not applicable.
  • r.PageLayout(), r.PageMode(), r.ViewerPreferences(), r.ViewerPreferencesInfo(): catalog keys per ISO 32000; typed accessors return (value, ok).
  • r.CatalogLang(), r.MarkInfo(), r.AcroForm(): catalog /Lang, /MarkInfo, /AcroForm.
  • r.XFA(): decoded /AcroForm / XFA streams as map[string][]byte.
  • r.CatalogThreads(), r.Threads(): catalog /Threads.
  • r.OpenAction(), r.OpenActionNamedDestination(): raw /OpenAction objects / named-dest text.
  • r.PageLabels() / r.PageLabel(i): ISO 32000 §12.4.2 page labels; default decimal "1""n" without /PageLabels.
  • r.HasXRefEntry, r.InUseObjectIDsGen0(), r.PagesRoot(): xref and unresolved page-tree helpers.

Document info and page geometry

  • r.DocumentInformation() / r.Metadata(): trailer /Info; use TextWithReader / TitleWithReader when values are indirect refs. Raw accessors return unresolved generic.Object.
  • pdf.ParsePDFDate, CreationDateTimeWithReader, ModDateTimeWithReader: PDF date strings → time.Time where applicable.
  • p.MediaBox()p.BleedBox(): box arrays per ISO 32000 (inheritance from /Pages when flattening).
  • p.MediaBoxSize(), p.CropBoxSize(), … p.ArtBoxSize(), Effective*Box(), MediaBoxRect() …: derived sizes and RectangleObject helpers.
  • p.Resources(), p.Annots(), p.Annotations(), p.AnnotDicts(): page resources and annotations.
  • r.DestinationPageIndex, r.NamedDestinations(), r.NamedDestRoot(), r.EmbeddedFiles() / attachments APIs: destinations and embedded files (ISO 32000 §7.11).
  • pdf.LinkAnnotationTarget: URI / GoTo / GoToR / Launch extraction from /Subtype /Link.
  • p.Rotate(), p.RotationDegrees(), p.UserUnit(), p.UserUnitOrDefault(), p.PageNumber(): rotation and user unit.
  • p.ContentsObject() / p.GetContentsObject(), p.ContentsBytes() / p.GetContentsBytes(): raw vs decoded /Contents.
  • r.DecodeStream: apply the reader’s filter chain to a stream.
  • r.PDFHeaderPrefix(), r.PDFHeader(), r.PDFVersion(), pdf.Version: file header and module version.
  • r.XMPMetadata(), r.Outlines(), r.Outline(): XMP package bytes and outline tree.
  • r.GetPageNumber, r.GetDestinationPageNumber, r.FormFields(), r.FormTextFields(), r.PagesShowingField(), p.GetFonts(): forms and font listing helpers.
  • p.RotatePage, p.SetRotation, p.AddTransformation, p.ScalePage*, p.CompressContentStreams(), p.TransferRotationToContent(): page transforms and content compression.
  • p.ExtractText(): heuristic text scan (no full font/CMap pipeline).

ExtractText vs ExtractTextAdvanced

Page.ExtractText() Page.ExtractTextAdvanced(opts)
Role lightweight heuristic full-stream text with ToUnicode / widths
Unicode no ToUnicode / CMap uses page resources and font maps
Content literal scan structured ops; optional Form XObject text
Layout fixed heuristics Orientations, SpaceWidth, visitor hooks

Use ExtractTextAdvanced when you need better Unicode coverage; ExtractText for quick English or debugging. See examples/readtextadvanced, examples/readtext, examples/diagtext.

  • p.ExtractTextAdvanced(opts): see ExtractTextOptions in source for all fields.

pdf/generic

Core syntax types: Null, Bool, Number, Name, *StringObj, Array, *Dict, *Stream, *Indirect.

  • generic.ReadObject, generic.WriteObject: parse / emit PDF syntax (writer subset).
  • d.Del(key): delete a dict entry (writer helpers).

PdfWriter

  1. Hand-built object graph: AddObject in dependency order; last object must be /Catalog; trailer /Root points to it. Trailer /ID by default; set OmitTrailerDocumentID for deterministic tests.
  2. Append from reader: AppendPagesFromReader / AppendPagesFromReaderPageRange deep-clone pages; non-empty writers require catalog-at-end conventions (see source Bytes() validation).

Writer APIs include blank/add/insert/remove pages, catalog metadata, attachments, JavaScript name tree, /OpenAction, named destinations, URI links, page labels, threads, outlines, merge helpers, Remove*Writer utilities, Bytes() / Write() with ErrWriterNoObjects, ErrWriterLastNotCatalog, etc.

Page merge and content removal

  • MergePage, MergeTransformedPage, MergeScaledPage, MergeRotatedPage, MergeTranslatedPage, ReplaceContents: combine or replace page content streams.
  • RemoveAnnotations, RemoveLinks, RemoveImages, RemoveText: strip operators or annotations from a page.

pdf/filters

FlateDecode / FlateEncode, ASCIIHexDecode, ASCII85Decode, RunLengthDecode, LZWDecode, DCTDecode, JPXDecode, JBIG2Decode, CCITTFaxDecode (image filters return compressed payload; raster decode not implemented), ApplyPostFlatePredictor.

The reader’s decodeStreamData recognizes long and short filter names per ISO 32000.

Errors

Exported error values include ErrEmptyFile, ErrParse, ErrNotPDF, ErrEncrypted, ErrNoContents, ErrPageSizeNotDefined, page-range merge errors, ErrObjectNotFound, etc.; ReadError / PdfReadError wrap richer failures.

Scope

No full encryption pipeline, JavaScript execution, or raster image decode; ExtractText* are heuristics, not a layout engine. AppendPagesFromReader non-empty merge targets PDFs produced with this writer’s catalog-at-end layout.

ImageType

Bit flags for future image enumeration APIs (ImageTypeNone, ImageTypeXObjectImages, …).

Paper sizes

PaperA0PaperA8, PaperC4: ISO 216 (and related) sizes in PDF points (72 dpi).

Text matrices

pdf.Mult, pdf.MatrixMultiply: multiply PDF text-space matrices.

Transformation

NewTransformation, Translate, Scale, Rotate, Transform, ApplyOn, Matrix, CompressMatrix, ToCM(): 2D transforms and cm operator strings.

ConvertToInt

pdf.ConvertToInt: big-endian signed int from the last size bytes (xref /W entries); used by readWInt.

Page ranges (Python-like slices)

  • pdf.ParsePageRange: strings such as :, 0:3, 5:, :-1, ::2, -1, -2.
  • PageRange.Indices, String, Equal, MergePageRanges, ValidPageRange, ParseFilenamePageRanges, PageRangeAll.

For exact semantics and edge cases, prefer go doc and the pdf package tests as the source of truth.