Releases · scientist-labs/parsekit

20 Jun 02:53

0.2.0

d928ee2

v0.2.0 Latest

Latest

ParseKit provides native Ruby bindings for the parser-core Rust crate, extracting text from PDFs, Office documents (DOCX, XLSX), and images (via Tesseract OCR) through Magnus. It is part of the ruby-nlp ecosystem.

Headline: 0.2.0 is a packaging and tooling release — there are no functional or API changes. The story here is the build pipeline: ParseKit now ships precompiled native gems, so on supported platforms you no longer need a Rust toolchain or a system MuPDF/Tesseract install just to bundle install.

This release adopts the shared scientist-labs/rust-gem-release workflow, refreshes several Rust dependencies, and bumps CI actions. Functionally, parsing behavior is identical to 0.1.3.

What ships

RubyGems automatically selects the matching binary for your platform and Ruby ABI, and falls back to the source ruby gem when no precompiled match exists.

Platform	Install requirement
`x86_64-linux`	none (precompiled)
`arm64-darwin`	none (precompiled)
`ruby` (source)	Rust toolchain

Precompiled binaries cover the Ruby 3.1, 3.2, 3.3, and 3.4 ABIs (the gem supports Ruby >= 3.0; a 3.0 install falls back to the source gem).

A note on aarch64-linux

aarch64-linux is not precompiled in this release. Its statically-linked MuPDF/Tesseract cross-build is heavy and is intentionally disabled for now, so ARM Linux installs fall back to the source ruby gem and will need a Rust toolchain plus the usual system build dependencies.

Changelog

Adopt scientist-labs/rust-gem-release@v0 for releases (#35, @xrl)
Add "type a version in a box" release dispatch via rust-gem-release@0.11.0 (#37)
Fix stale linux-cross-image-repo comments (#36, @xrl)
Bump tesseract-rs 0.1 → 0.2 (#31), calamine 0.34 → 0.35 (#32), quick-xml 0.39 → 0.40 (#33), mupdf 0.6 → 0.7 (#34)
Bump actions/checkout 4 → 6 (#30)

Full Changelog: 0.1.3...0.2.0

Contributors

xrl

Assets 5

24 Mar 13:01

github-actions

0.1.3

0a7cfda

v0.1.3

What's Changed

Features & Improvements

Centralized format detection: New format_detector.rs module in Rust and ParseKit.detect_format method in Ruby for consistent file format identification across the library
Simplified format dispatch: Refactored parser routing to reduce complexity in the core parser module
Refactored validation helpers: Cleaner, more maintainable validation logic
Improved Rust error handling: Refactored error handling code for better consistency and clarity
Automated release workflow: Added GitHub Actions workflow (release.yml) with workflow_dispatch for automated gem publishing
SUPPORTED_FORMATS constant: New module-level constant mapping format symbols to file extensions

Dependency Updates

calamine (Excel parsing): 0.30 -> 0.34
mupdf (PDF parsing): 0.5 -> 0.6
quick-xml (XML parsing): 0.38 -> 0.39
zip (PPTX handling): 2.1 -> 8.0
actions/cache: 3 -> 5
actions/checkout: 5 -> 6

Testing

Added comprehensive test suites for format detection, dispatch logic, error consistency, and validation helpers (+1,441 lines of specs)
Fixed a failing spec related to the refactors

Housekeeping

Moved repository to scientist-labs organization
Cleaned up dead code in parser module
Streamlined Ruby-side documentation and class structure

Full Changelog: 0.1.0...0.1.3

Assets 3

06 Sep 02:34

cpetersen

0.1.0

e6b578f

ParseKit 0.1.0 Release 🚀

We're excited to announce the initial release of ParseKit, a Ruby document parsing toolkit that brings native performance to document text extraction with zero runtime dependencies!

🎯 What is ParseKit?

ParseKit is a native Ruby gem that extracts text from various document formats using high-performance Rust implementations. Unlike other Ruby document parsing solutions, ParseKit bundles all necessary libraries statically, making installation simple with no system dependencies required.

Key Features

📄 Multiple Format Support: PDF, DOCX, XLSX, XLS, PPTX, images (PNG, JPG, TIFF, BMP)
🔍 Built-in OCR: Bundled Tesseract for image text extraction
⚡ Native Performance: Rust-powered parsing with Ruby convenience
📦 Zero Dependencies: Everything bundled - just gem install and go
🛡️ Cross-Platform: Works on Linux, macOS, and Windows

📚 Supported Formats

Format	Extensions	Method	Features
PDF	.pdf	`parse_pdf`	Text extraction via MuPDF
Word	.docx	`parse_docx`	Office Open XML format
PowerPoint	.pptx	`parse_pptx`	Text from slides and notes
Excel	.xlsx, .xls	`parse_xlsx`	Both modern and legacy formats
Images	.png, .jpg, .jpeg, .tiff, .bmp	`ocr_image`	OCR via bundled Tesseract
JSON	.json	`parse_json`	Pretty-printed output
XML/HTML	.xml, .html	`parse_xml`	Text content extraction
Text	.txt, .csv, .md	`parse_text`	With encoding detection

🚀 Quick Start

Installation

gem install parsekit

Or add to your Gemfile:

gem 'parsekit', '~> 0.1.0'

Basic Usage

require 'parsekit'

# Simple file parsing - format auto-detected
text = ParseKit.parse_file("document.pdf")
puts text

# Parse binary data directly  
file_data = File.binread("document.docx")
text = ParseKit.parse_bytes(file_data.bytes)
puts text

# Use parser instance for multiple files
parser = ParseKit::Parser.new
text = parser.parse_file("report.xlsx")
puts text

Advanced Usage

# Direct format-specific parsing
parser = ParseKit::Parser.new

# PDF text extraction
pdf_data = File.read('document.pdf', mode: 'rb').bytes
pdf_text = parser.parse_pdf(pdf_data)

# OCR on images
image_data = File.read('scan.png', mode: 'rb').bytes
ocr_text = parser.ocr_image(image_data)

# PowerPoint presentations  
pptx_data = File.read('slides.pptx', mode: 'rb').bytes
slide_text = parser.parse_pptx(pptx_data)

# Excel spreadsheets
xlsx_data = File.read('data.xlsx', mode: 'rb').bytes
sheet_text = parser.parse_xlsx(xlsx_data)

Configuration Options

# Create parser with options
parser = ParseKit::Parser.new(
  strict_mode: true,
  max_size: 50 * 1024 * 1024,  # 50MB limit
  encoding: 'UTF-8'
)

# Or use the strict convenience method
parser = ParseKit::Parser.strict

🔧 Technical Architecture

ParseKit uses a hybrid Ruby/Rust architecture:

Ruby Layer: Provides convenient API and format detection
Rust Layer: High-performance parsing using:
- MuPDF for PDF text extraction (statically linked)
- tesseract-rs for OCR (bundled Tesseract by default)
- docx-rs for Word document parsing
- calamine for Excel parsing
- zip + quick-xml for PowerPoint parsing
- Magnus for Ruby-Rust FFI bindings

🎨 Zero-Dependency Philosophy

Traditional Ruby document parsing requires complex system dependencies:

Tesseract OCR installation
Poppler for PDF handling
ImageMagick for image processing
Platform-specific libraries

ParseKit eliminates all of this by bundling everything needed:

# Traditional approach
brew install tesseract poppler imagemagick  # macOS
sudo apt-get install tesseract-ocr poppler-utils imagemagick  # Ubuntu
gem install some-parsing-gem

# ParseKit approach  
gem install parsekit  # Done!

⚡ Performance Features

Native Rust Speed: Core parsing implemented in Rust for maximum performance
Statically Linked Libraries: MuPDF and Tesseract compiled with optimizations
Efficient Memory Usage: Streaming where possible, configurable size limits
Smart Format Detection: Magic number detection with filename fallback

🛠️ Advanced OCR Configuration

ParseKit includes two OCR modes for maximum flexibility:

Bundled Mode (Default)

# Zero setup - works out of the box
parser = ParseKit::Parser.new
text = parser.ocr_image(image_data)

System Mode (Advanced Users)

For developers who want faster gem installation and already have Tesseract:

# Install without bundled features
gem install parsekit -- --no-default-features

# For development
rake compile CARGO_FEATURES=""  # Disables bundled-tesseract

🧪 Real-World Examples

Batch Document Processing

require 'parsekit'

parser = ParseKit::Parser.new
documents_dir = "path/to/documents"

Dir.glob("#{documents_dir}/*.{pdf,docx,xlsx,pptx,png,jpg}").each do |file|
  begin
    text = parser.parse_file(file)
    
    # Process extracted text
    puts "#{file}: #{text.length} characters extracted"
    
    # Save to text file
    output_file = file.gsub(/\.[^.]+$/, '.txt')
    File.write(output_file, text)
  rescue => e
    puts "Error processing #{file}: #{e.message}"
  end
end

OCR Pipeline

require 'parsekit'

def extract_text_from_images(image_dir)
  parser = ParseKit::Parser.new
  results = {}
  
  Dir.glob("#{image_dir}/*.{png,jpg,jpeg,tiff,bmp}").each do |image_file|
    puts "Processing #{image_file}..."
    
    image_data = File.read(image_file, mode: 'rb').bytes
    text = parser.ocr_image(image_data)
    
    results[image_file] = {
      text: text,
      length: text.length,
      processed_at: Time.now
    }
  end
  
  results
end

# Process all images
results = extract_text_from_images("scanned_documents/")
results.each do |file, data|
  puts "#{file}: #{data[:length]} chars - #{data[:text][0..100]}..."
end

Document Classification

require 'parsekit'

class DocumentClassifier
  def initialize
    @parser = ParseKit::Parser.new
  end
  
  def classify(file_path)
    text = @parser.parse_file(file_path)
    
    case text
    when /\b(invoice|bill|payment)\b/i
      :invoice
    when /\b(resume|curriculum vitae|cv)\b/i  
      :resume
    when /\b(contract|agreement|terms)\b/i
      :contract
    when /\b(report|analysis|summary)\b/i
      :report
    else
      :unknown
    end
  end
end

classifier = DocumentClassifier.new

Dir.glob("uploads/*.{pdf,docx}").each do |file|
  category = classifier.classify(file)
  puts "#{file} -> #{category}"
  
  # Move to appropriate directory
  FileUtils.mkdir_p("sorted/#{category}")
  FileUtils.mv(file, "sorted/#{category}/")
end

🔄 Migration Guide

Coming from other Ruby document parsing gems? Here's how ParseKit compares:

From pdf-reader

# Before (pdf-reader)
require 'pdf-reader'
text = PDF::Reader.new('document.pdf').pages.map(&:text).join

# After (ParseKit)  
require 'parsekit'
text = ParseKit.parse_file('document.pdf')

From docx gem

# Before (docx)
require 'docx'
doc = Docx::Document.open('document.docx')
text = doc.paragraphs.map(&:text).join

# After (ParseKit)
require 'parsekit' 
text = ParseKit.parse_file('document.docx')

From RTesseract

# Before (RTesseract - requires system tesseract)
require 'rtesseract'  
text = RTesseract.new('image.png').to_s

# After (ParseKit - zero dependencies)
require 'parsekit'
text = ParseKit.parse_file('image.png')

📦 Installation Requirements

Ruby: >= 3.0.0
Rust: Automatically handled during gem installation
System Dependencies: None! Everything is bundled

🙏 Acknowledgments

ParseKit builds on excellent Rust crates:

mupdf for PDF parsing
tesseract-rs for OCR
docx-rs for Word documents
calamine for Excel files
quick-xml and zip for PowerPoint
magnus for Ruby-Rust integration

🚀 Ready to Parse?

Install ParseKit 0.1.0 today and start extracting text from any document format with zero hassle:

gem install parsekit

No system dependencies. No complex setup. Just install and parse! 🎯✨

For documentation, examples, and source code, visit: github.com/cpetersen/parsekit

Assets 2

Uh oh!

Releases: scientist-labs/parsekit

v0.2.0

What ships

A note on aarch64-linux

Changelog

Contributors

Uh oh!

v0.1.3

What's Changed

Features & Improvements

Dependency Updates

Testing

Housekeeping

Uh oh!

ParseKit 0.1.0 Release 🚀

🎯 What is ParseKit?

Key Features

📚 Supported Formats

🚀 Quick Start

Installation

Basic Usage

Advanced Usage

Configuration Options

🔧 Technical Architecture

🎨 Zero-Dependency Philosophy

⚡ Performance Features

🛠️ Advanced OCR Configuration

Bundled Mode (Default)

System Mode (Advanced Users)

🧪 Real-World Examples

Batch Document Processing

OCR Pipeline

Document Classification

🔄 Migration Guide

From pdf-reader

From docx gem

From RTesseract

📦 Installation Requirements

🙏 Acknowledgments

🚀 Ready to Parse?

Uh oh!