Releases: scientist-labs/parsekit
v0.2.0
ParseKit provides native Ruby bindings for the parser-core Rust crate, extracting text from PDFs, Office documents (DOCX, XLSX), and images (via Tesseract OCR) through Magnus. It is part of the ruby-nlp ecosystem.
Headline: 0.2.0 is a packaging and tooling release — there are no functional or API changes. The story here is the build pipeline: ParseKit now ships precompiled native gems, so on supported platforms you no longer need a Rust toolchain or a system MuPDF/Tesseract install just to bundle install.
This release adopts the shared scientist-labs/rust-gem-release workflow, refreshes several Rust dependencies, and bumps CI actions. Functionally, parsing behavior is identical to 0.1.3.
What ships
RubyGems automatically selects the matching binary for your platform and Ruby ABI, and falls back to the source ruby gem when no precompiled match exists.
| Platform | Install requirement |
|---|---|
x86_64-linux |
none (precompiled) |
arm64-darwin |
none (precompiled) |
ruby (source) |
Rust toolchain |
Precompiled binaries cover the Ruby 3.1, 3.2, 3.3, and 3.4 ABIs (the gem supports Ruby >= 3.0; a 3.0 install falls back to the source gem).
A note on aarch64-linux
aarch64-linux is not precompiled in this release. Its statically-linked MuPDF/Tesseract cross-build is heavy and is intentionally disabled for now, so ARM Linux installs fall back to the source ruby gem and will need a Rust toolchain plus the usual system build dependencies.
Changelog
- Adopt
scientist-labs/rust-gem-release@v0for releases (#35, @xrl) - Add "type a version in a box" release dispatch via rust-gem-release@0.11.0 (#37)
- Fix stale linux-cross-image-repo comments (#36, @xrl)
- Bump
tesseract-rs0.1 → 0.2 (#31),calamine0.34 → 0.35 (#32),quick-xml0.39 → 0.40 (#33),mupdf0.6 → 0.7 (#34) - Bump
actions/checkout4 → 6 (#30)
Full Changelog: 0.1.3...0.2.0
v0.1.3
What's Changed
Features & Improvements
- Centralized format detection: New
format_detector.rsmodule in Rust andParseKit.detect_formatmethod in Ruby for consistent file format identification across the library - Simplified format dispatch: Refactored parser routing to reduce complexity in the core parser module
- Refactored validation helpers: Cleaner, more maintainable validation logic
- Improved Rust error handling: Refactored error handling code for better consistency and clarity
- Automated release workflow: Added GitHub Actions workflow (
release.yml) withworkflow_dispatchfor automated gem publishing SUPPORTED_FORMATSconstant: New module-level constant mapping format symbols to file extensions
Dependency Updates
calamine(Excel parsing): 0.30 -> 0.34mupdf(PDF parsing): 0.5 -> 0.6quick-xml(XML parsing): 0.38 -> 0.39zip(PPTX handling): 2.1 -> 8.0actions/cache: 3 -> 5actions/checkout: 5 -> 6
Testing
- Added comprehensive test suites for format detection, dispatch logic, error consistency, and validation helpers (+1,441 lines of specs)
- Fixed a failing spec related to the refactors
Housekeeping
- Moved repository to
scientist-labsorganization - Cleaned up dead code in parser module
- Streamlined Ruby-side documentation and class structure
Full Changelog: 0.1.0...0.1.3
ParseKit 0.1.0 Release 🚀
We're excited to announce the initial release of ParseKit, a Ruby document parsing toolkit that brings native performance to document text extraction with zero runtime dependencies!
🎯 What is ParseKit?
ParseKit is a native Ruby gem that extracts text from various document formats using high-performance Rust implementations. Unlike other Ruby document parsing solutions, ParseKit bundles all necessary libraries statically, making installation simple with no system dependencies required.
Key Features
- 📄 Multiple Format Support: PDF, DOCX, XLSX, XLS, PPTX, images (PNG, JPG, TIFF, BMP)
- 🔍 Built-in OCR: Bundled Tesseract for image text extraction
- ⚡ Native Performance: Rust-powered parsing with Ruby convenience
- 📦 Zero Dependencies: Everything bundled - just
gem installand go - 🛡️ Cross-Platform: Works on Linux, macOS, and Windows
📚 Supported Formats
| Format | Extensions | Method | Features |
|---|---|---|---|
parse_pdf |
Text extraction via MuPDF | ||
| Word | .docx | parse_docx |
Office Open XML format |
| PowerPoint | .pptx | parse_pptx |
Text from slides and notes |
| Excel | .xlsx, .xls | parse_xlsx |
Both modern and legacy formats |
| Images | .png, .jpg, .jpeg, .tiff, .bmp | ocr_image |
OCR via bundled Tesseract |
| JSON | .json | parse_json |
Pretty-printed output |
| XML/HTML | .xml, .html | parse_xml |
Text content extraction |
| Text | .txt, .csv, .md | parse_text |
With encoding detection |
🚀 Quick Start
Installation
gem install parsekitOr add to your Gemfile:
gem 'parsekit', '~> 0.1.0'Basic Usage
require 'parsekit'
# Simple file parsing - format auto-detected
text = ParseKit.parse_file("document.pdf")
puts text
# Parse binary data directly
file_data = File.binread("document.docx")
text = ParseKit.parse_bytes(file_data.bytes)
puts text
# Use parser instance for multiple files
parser = ParseKit::Parser.new
text = parser.parse_file("report.xlsx")
puts textAdvanced Usage
# Direct format-specific parsing
parser = ParseKit::Parser.new
# PDF text extraction
pdf_data = File.read('document.pdf', mode: 'rb').bytes
pdf_text = parser.parse_pdf(pdf_data)
# OCR on images
image_data = File.read('scan.png', mode: 'rb').bytes
ocr_text = parser.ocr_image(image_data)
# PowerPoint presentations
pptx_data = File.read('slides.pptx', mode: 'rb').bytes
slide_text = parser.parse_pptx(pptx_data)
# Excel spreadsheets
xlsx_data = File.read('data.xlsx', mode: 'rb').bytes
sheet_text = parser.parse_xlsx(xlsx_data)Configuration Options
# Create parser with options
parser = ParseKit::Parser.new(
strict_mode: true,
max_size: 50 * 1024 * 1024, # 50MB limit
encoding: 'UTF-8'
)
# Or use the strict convenience method
parser = ParseKit::Parser.strict🔧 Technical Architecture
ParseKit uses a hybrid Ruby/Rust architecture:
- Ruby Layer: Provides convenient API and format detection
- Rust Layer: High-performance parsing using:
- MuPDF for PDF text extraction (statically linked)
- tesseract-rs for OCR (bundled Tesseract by default)
- docx-rs for Word document parsing
- calamine for Excel parsing
- zip + quick-xml for PowerPoint parsing
- Magnus for Ruby-Rust FFI bindings
🎨 Zero-Dependency Philosophy
Traditional Ruby document parsing requires complex system dependencies:
- Tesseract OCR installation
- Poppler for PDF handling
- ImageMagick for image processing
- Platform-specific libraries
ParseKit eliminates all of this by bundling everything needed:
# Traditional approach
brew install tesseract poppler imagemagick # macOS
sudo apt-get install tesseract-ocr poppler-utils imagemagick # Ubuntu
gem install some-parsing-gem
# ParseKit approach
gem install parsekit # Done! ⚡ Performance Features
- Native Rust Speed: Core parsing implemented in Rust for maximum performance
- Statically Linked Libraries: MuPDF and Tesseract compiled with optimizations
- Efficient Memory Usage: Streaming where possible, configurable size limits
- Smart Format Detection: Magic number detection with filename fallback
🛠️ Advanced OCR Configuration
ParseKit includes two OCR modes for maximum flexibility:
Bundled Mode (Default)
# Zero setup - works out of the box
parser = ParseKit::Parser.new
text = parser.ocr_image(image_data)System Mode (Advanced Users)
For developers who want faster gem installation and already have Tesseract:
# Install without bundled features
gem install parsekit -- --no-default-features
# For development
rake compile CARGO_FEATURES="" # Disables bundled-tesseract🧪 Real-World Examples
Batch Document Processing
require 'parsekit'
parser = ParseKit::Parser.new
documents_dir = "path/to/documents"
Dir.glob("#{documents_dir}/*.{pdf,docx,xlsx,pptx,png,jpg}").each do |file|
begin
text = parser.parse_file(file)
# Process extracted text
puts "#{file}: #{text.length} characters extracted"
# Save to text file
output_file = file.gsub(/\.[^.]+$/, '.txt')
File.write(output_file, text)
rescue => e
puts "Error processing #{file}: #{e.message}"
end
endOCR Pipeline
require 'parsekit'
def extract_text_from_images(image_dir)
parser = ParseKit::Parser.new
results = {}
Dir.glob("#{image_dir}/*.{png,jpg,jpeg,tiff,bmp}").each do |image_file|
puts "Processing #{image_file}..."
image_data = File.read(image_file, mode: 'rb').bytes
text = parser.ocr_image(image_data)
results[image_file] = {
text: text,
length: text.length,
processed_at: Time.now
}
end
results
end
# Process all images
results = extract_text_from_images("scanned_documents/")
results.each do |file, data|
puts "#{file}: #{data[:length]} chars - #{data[:text][0..100]}..."
endDocument Classification
require 'parsekit'
class DocumentClassifier
def initialize
@parser = ParseKit::Parser.new
end
def classify(file_path)
text = @parser.parse_file(file_path)
case text
when /\b(invoice|bill|payment)\b/i
:invoice
when /\b(resume|curriculum vitae|cv)\b/i
:resume
when /\b(contract|agreement|terms)\b/i
:contract
when /\b(report|analysis|summary)\b/i
:report
else
:unknown
end
end
end
classifier = DocumentClassifier.new
Dir.glob("uploads/*.{pdf,docx}").each do |file|
category = classifier.classify(file)
puts "#{file} -> #{category}"
# Move to appropriate directory
FileUtils.mkdir_p("sorted/#{category}")
FileUtils.mv(file, "sorted/#{category}/")
end🔄 Migration Guide
Coming from other Ruby document parsing gems? Here's how ParseKit compares:
From pdf-reader
# Before (pdf-reader)
require 'pdf-reader'
text = PDF::Reader.new('document.pdf').pages.map(&:text).join
# After (ParseKit)
require 'parsekit'
text = ParseKit.parse_file('document.pdf')From docx gem
# Before (docx)
require 'docx'
doc = Docx::Document.open('document.docx')
text = doc.paragraphs.map(&:text).join
# After (ParseKit)
require 'parsekit'
text = ParseKit.parse_file('document.docx')From RTesseract
# Before (RTesseract - requires system tesseract)
require 'rtesseract'
text = RTesseract.new('image.png').to_s
# After (ParseKit - zero dependencies)
require 'parsekit'
text = ParseKit.parse_file('image.png')📦 Installation Requirements
- Ruby: >= 3.0.0
- Rust: Automatically handled during gem installation
- System Dependencies: None! Everything is bundled
🙏 Acknowledgments
ParseKit builds on excellent Rust crates:
- mupdf for PDF parsing
- tesseract-rs for OCR
- docx-rs for Word documents
- calamine for Excel files
- quick-xml and zip for PowerPoint
- magnus for Ruby-Rust integration
🚀 Ready to Parse?
Install ParseKit 0.1.0 today and start extracting text from any document format with zero hassle:
gem install parsekitNo system dependencies. No complex setup. Just install and parse! 🎯✨
For documentation, examples, and source code, visit: github.com/cpetersen/parsekit