Skip to content

Support embedded images in PDFs #3

@xavriley

Description

@xavriley

I'm putting this here because there's no Github issues on the https://github.com/mattfullerton/tika-tesseract-docker repo.

I've worked on a fork that enables Tika to run OCR on embedded images within PDFs https://github.com/xavriley/tika-tesseract-docker This involves a fairly minimal find and replace in one of the .properties files in the Tika source as part of the install.sh script, but I think that approach should be fairly robust.

With the ability to OCR embedded images this means that you no longer have to split documents up into images first which might be useful for less technical users.

I was working on this as a pet project to kick the tyres on Heroku's new Docker support (details in the README) and it works well, although Heroku have a 30 second time limit on web requests which means that it's not a practical host if you have larger files. I'm going to try building a proxy to work around that in time. For that reason I haven't shaped it as a pull request because the Dockerfile has changes specifically for Heroku, but I thought the approach might be useful for the OKFN project.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions