Support embedded images in PDFs

I'm putting this here because there's no Github issues on the https://github.com/mattfullerton/tika-tesseract-docker repo.

I've worked on a fork that enables Tika to run OCR on embedded images within PDFs https://github.com/xavriley/tika-tesseract-docker This involves a fairly minimal find and replace in one of the `.properties` files in the Tika source as part of the `install.sh` script, but I think that approach should be fairly robust.

With the ability to OCR embedded images this means that you no longer have to split documents up into images first which might be useful for less technical users.

I was working on this as a pet project to kick the tyres on Heroku's new Docker support (details in the `README`) and it works well, although Heroku have a 30 second time limit on web requests which means that it's not a practical host if you have larger files. I'm going to try building a proxy to work around that in time. For that reason I haven't shaped it as a pull request because the `Dockerfile` has changes specifically for Heroku, but I thought the approach might be useful for the OKFN project.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support embedded images in PDFs #3

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Support embedded images in PDFs #3

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions