Skip to content

Releases: scientist-labs/parsekit

v0.2.0

20 Jun 02:53

Choose a tag to compare

ParseKit provides native Ruby bindings for the parser-core Rust crate, extracting text from PDFs, Office documents (DOCX, XLSX), and images (via Tesseract OCR) through Magnus. It is part of the ruby-nlp ecosystem.

Headline: 0.2.0 is a packaging and tooling release — there are no functional or API changes. The story here is the build pipeline: ParseKit now ships precompiled native gems, so on supported platforms you no longer need a Rust toolchain or a system MuPDF/Tesseract install just to bundle install.

This release adopts the shared scientist-labs/rust-gem-release workflow, refreshes several Rust dependencies, and bumps CI actions. Functionally, parsing behavior is identical to 0.1.3.

What ships

RubyGems automatically selects the matching binary for your platform and Ruby ABI, and falls back to the source ruby gem when no precompiled match exists.

Platform Install requirement
x86_64-linux none (precompiled)
arm64-darwin none (precompiled)
ruby (source) Rust toolchain

Precompiled binaries cover the Ruby 3.1, 3.2, 3.3, and 3.4 ABIs (the gem supports Ruby >= 3.0; a 3.0 install falls back to the source gem).

A note on aarch64-linux

aarch64-linux is not precompiled in this release. Its statically-linked MuPDF/Tesseract cross-build is heavy and is intentionally disabled for now, so ARM Linux installs fall back to the source ruby gem and will need a Rust toolchain plus the usual system build dependencies.

Changelog

  • Adopt scientist-labs/rust-gem-release@v0 for releases (#35, @xrl)
  • Add "type a version in a box" release dispatch via rust-gem-release@0.11.0 (#37)
  • Fix stale linux-cross-image-repo comments (#36, @xrl)
  • Bump tesseract-rs 0.1 → 0.2 (#31), calamine 0.34 → 0.35 (#32), quick-xml 0.39 → 0.40 (#33), mupdf 0.6 → 0.7 (#34)
  • Bump actions/checkout 4 → 6 (#30)

Full Changelog: 0.1.3...0.2.0

v0.1.3

24 Mar 13:01

Choose a tag to compare

What's Changed

Features & Improvements

  • Centralized format detection: New format_detector.rs module in Rust and ParseKit.detect_format method in Ruby for consistent file format identification across the library
  • Simplified format dispatch: Refactored parser routing to reduce complexity in the core parser module
  • Refactored validation helpers: Cleaner, more maintainable validation logic
  • Improved Rust error handling: Refactored error handling code for better consistency and clarity
  • Automated release workflow: Added GitHub Actions workflow (release.yml) with workflow_dispatch for automated gem publishing
  • SUPPORTED_FORMATS constant: New module-level constant mapping format symbols to file extensions

Dependency Updates

  • calamine (Excel parsing): 0.30 -> 0.34
  • mupdf (PDF parsing): 0.5 -> 0.6
  • quick-xml (XML parsing): 0.38 -> 0.39
  • zip (PPTX handling): 2.1 -> 8.0
  • actions/cache: 3 -> 5
  • actions/checkout: 5 -> 6

Testing

  • Added comprehensive test suites for format detection, dispatch logic, error consistency, and validation helpers (+1,441 lines of specs)
  • Fixed a failing spec related to the refactors

Housekeeping

  • Moved repository to scientist-labs organization
  • Cleaned up dead code in parser module
  • Streamlined Ruby-side documentation and class structure

Full Changelog: 0.1.0...0.1.3

ParseKit 0.1.0 Release 🚀

06 Sep 02:34
e6b578f

Choose a tag to compare

We're excited to announce the initial release of ParseKit, a Ruby document parsing toolkit that brings native performance to document text extraction with zero runtime dependencies!

🎯 What is ParseKit?

ParseKit is a native Ruby gem that extracts text from various document formats using high-performance Rust implementations. Unlike other Ruby document parsing solutions, ParseKit bundles all necessary libraries statically, making installation simple with no system dependencies required.

Key Features

  • 📄 Multiple Format Support: PDF, DOCX, XLSX, XLS, PPTX, images (PNG, JPG, TIFF, BMP)
  • 🔍 Built-in OCR: Bundled Tesseract for image text extraction
  • ⚡ Native Performance: Rust-powered parsing with Ruby convenience
  • 📦 Zero Dependencies: Everything bundled - just gem install and go
  • 🛡️ Cross-Platform: Works on Linux, macOS, and Windows

📚 Supported Formats

Format Extensions Method Features
PDF .pdf parse_pdf Text extraction via MuPDF
Word .docx parse_docx Office Open XML format
PowerPoint .pptx parse_pptx Text from slides and notes
Excel .xlsx, .xls parse_xlsx Both modern and legacy formats
Images .png, .jpg, .jpeg, .tiff, .bmp ocr_image OCR via bundled Tesseract
JSON .json parse_json Pretty-printed output
XML/HTML .xml, .html parse_xml Text content extraction
Text .txt, .csv, .md parse_text With encoding detection

🚀 Quick Start

Installation

gem install parsekit

Or add to your Gemfile:

gem 'parsekit', '~> 0.1.0'

Basic Usage

require 'parsekit'

# Simple file parsing - format auto-detected
text = ParseKit.parse_file("document.pdf")
puts text

# Parse binary data directly  
file_data = File.binread("document.docx")
text = ParseKit.parse_bytes(file_data.bytes)
puts text

# Use parser instance for multiple files
parser = ParseKit::Parser.new
text = parser.parse_file("report.xlsx")
puts text

Advanced Usage

# Direct format-specific parsing
parser = ParseKit::Parser.new

# PDF text extraction
pdf_data = File.read('document.pdf', mode: 'rb').bytes
pdf_text = parser.parse_pdf(pdf_data)

# OCR on images
image_data = File.read('scan.png', mode: 'rb').bytes
ocr_text = parser.ocr_image(image_data)

# PowerPoint presentations  
pptx_data = File.read('slides.pptx', mode: 'rb').bytes
slide_text = parser.parse_pptx(pptx_data)

# Excel spreadsheets
xlsx_data = File.read('data.xlsx', mode: 'rb').bytes
sheet_text = parser.parse_xlsx(xlsx_data)

Configuration Options

# Create parser with options
parser = ParseKit::Parser.new(
  strict_mode: true,
  max_size: 50 * 1024 * 1024,  # 50MB limit
  encoding: 'UTF-8'
)

# Or use the strict convenience method
parser = ParseKit::Parser.strict

🔧 Technical Architecture

ParseKit uses a hybrid Ruby/Rust architecture:

  • Ruby Layer: Provides convenient API and format detection
  • Rust Layer: High-performance parsing using:
    • MuPDF for PDF text extraction (statically linked)
    • tesseract-rs for OCR (bundled Tesseract by default)
    • docx-rs for Word document parsing
    • calamine for Excel parsing
    • zip + quick-xml for PowerPoint parsing
    • Magnus for Ruby-Rust FFI bindings

🎨 Zero-Dependency Philosophy

Traditional Ruby document parsing requires complex system dependencies:

  • Tesseract OCR installation
  • Poppler for PDF handling
  • ImageMagick for image processing
  • Platform-specific libraries

ParseKit eliminates all of this by bundling everything needed:

# Traditional approach
brew install tesseract poppler imagemagick  # macOS
sudo apt-get install tesseract-ocr poppler-utils imagemagick  # Ubuntu
gem install some-parsing-gem

# ParseKit approach  
gem install parsekit  # Done! 

⚡ Performance Features

  • Native Rust Speed: Core parsing implemented in Rust for maximum performance
  • Statically Linked Libraries: MuPDF and Tesseract compiled with optimizations
  • Efficient Memory Usage: Streaming where possible, configurable size limits
  • Smart Format Detection: Magic number detection with filename fallback

🛠️ Advanced OCR Configuration

ParseKit includes two OCR modes for maximum flexibility:

Bundled Mode (Default)

# Zero setup - works out of the box
parser = ParseKit::Parser.new
text = parser.ocr_image(image_data)

System Mode (Advanced Users)

For developers who want faster gem installation and already have Tesseract:

# Install without bundled features
gem install parsekit -- --no-default-features

# For development
rake compile CARGO_FEATURES=""  # Disables bundled-tesseract

🧪 Real-World Examples

Batch Document Processing

require 'parsekit'

parser = ParseKit::Parser.new
documents_dir = "path/to/documents"

Dir.glob("#{documents_dir}/*.{pdf,docx,xlsx,pptx,png,jpg}").each do |file|
  begin
    text = parser.parse_file(file)
    
    # Process extracted text
    puts "#{file}: #{text.length} characters extracted"
    
    # Save to text file
    output_file = file.gsub(/\.[^.]+$/, '.txt')
    File.write(output_file, text)
  rescue => e
    puts "Error processing #{file}: #{e.message}"
  end
end

OCR Pipeline

require 'parsekit'

def extract_text_from_images(image_dir)
  parser = ParseKit::Parser.new
  results = {}
  
  Dir.glob("#{image_dir}/*.{png,jpg,jpeg,tiff,bmp}").each do |image_file|
    puts "Processing #{image_file}..."
    
    image_data = File.read(image_file, mode: 'rb').bytes
    text = parser.ocr_image(image_data)
    
    results[image_file] = {
      text: text,
      length: text.length,
      processed_at: Time.now
    }
  end
  
  results
end

# Process all images
results = extract_text_from_images("scanned_documents/")
results.each do |file, data|
  puts "#{file}: #{data[:length]} chars - #{data[:text][0..100]}..."
end

Document Classification

require 'parsekit'

class DocumentClassifier
  def initialize
    @parser = ParseKit::Parser.new
  end
  
  def classify(file_path)
    text = @parser.parse_file(file_path)
    
    case text
    when /\b(invoice|bill|payment)\b/i
      :invoice
    when /\b(resume|curriculum vitae|cv)\b/i  
      :resume
    when /\b(contract|agreement|terms)\b/i
      :contract
    when /\b(report|analysis|summary)\b/i
      :report
    else
      :unknown
    end
  end
end

classifier = DocumentClassifier.new

Dir.glob("uploads/*.{pdf,docx}").each do |file|
  category = classifier.classify(file)
  puts "#{file} -> #{category}"
  
  # Move to appropriate directory
  FileUtils.mkdir_p("sorted/#{category}")
  FileUtils.mv(file, "sorted/#{category}/")
end

🔄 Migration Guide

Coming from other Ruby document parsing gems? Here's how ParseKit compares:

From pdf-reader

# Before (pdf-reader)
require 'pdf-reader'
text = PDF::Reader.new('document.pdf').pages.map(&:text).join

# After (ParseKit)  
require 'parsekit'
text = ParseKit.parse_file('document.pdf')

From docx gem

# Before (docx)
require 'docx'
doc = Docx::Document.open('document.docx')
text = doc.paragraphs.map(&:text).join

# After (ParseKit)
require 'parsekit' 
text = ParseKit.parse_file('document.docx')

From RTesseract

# Before (RTesseract - requires system tesseract)
require 'rtesseract'  
text = RTesseract.new('image.png').to_s

# After (ParseKit - zero dependencies)
require 'parsekit'
text = ParseKit.parse_file('image.png')

📦 Installation Requirements

  • Ruby: >= 3.0.0
  • Rust: Automatically handled during gem installation
  • System Dependencies: None! Everything is bundled

🙏 Acknowledgments

ParseKit builds on excellent Rust crates:

🚀 Ready to Parse?

Install ParseKit 0.1.0 today and start extracting text from any document format with zero hassle:

gem install parsekit

No system dependencies. No complex setup. Just install and parse! 🎯✨


For documentation, examples, and source code, visit: github.com/cpetersen/parsekit