I'm putting this here because there's no Github issues on the https://github.com/mattfullerton/tika-tesseract-docker repo.
I've worked on a fork that enables Tika to run OCR on embedded images within PDFs https://github.com/xavriley/tika-tesseract-docker This involves a fairly minimal find and replace in one of the .properties files in the Tika source as part of the install.sh script, but I think that approach should be fairly robust.
With the ability to OCR embedded images this means that you no longer have to split documents up into images first which might be useful for less technical users.
I was working on this as a pet project to kick the tyres on Heroku's new Docker support (details in the README) and it works well, although Heroku have a 30 second time limit on web requests which means that it's not a practical host if you have larger files. I'm going to try building a proxy to work around that in time. For that reason I haven't shaped it as a pull request because the Dockerfile has changes specifically for Heroku, but I thought the approach might be useful for the OKFN project.
I'm putting this here because there's no Github issues on the https://github.com/mattfullerton/tika-tesseract-docker repo.
I've worked on a fork that enables Tika to run OCR on embedded images within PDFs https://github.com/xavriley/tika-tesseract-docker This involves a fairly minimal find and replace in one of the
.propertiesfiles in the Tika source as part of theinstall.shscript, but I think that approach should be fairly robust.With the ability to OCR embedded images this means that you no longer have to split documents up into images first which might be useful for less technical users.
I was working on this as a pet project to kick the tyres on Heroku's new Docker support (details in the
README) and it works well, although Heroku have a 30 second time limit on web requests which means that it's not a practical host if you have larger files. I'm going to try building a proxy to work around that in time. For that reason I haven't shaped it as a pull request because theDockerfilehas changes specifically for Heroku, but I thought the approach might be useful for the OKFN project.