How to speed up OCR in Paperwork/OpenPaper Windows

I installed Paperwork about 3 weeks ago (Windows 10). I leave it running day and night. I have about 400 documents (none longer than 20 p.) to OCR and import from a local directory. The import/OCR task seems to run only sporadically and in the background. After 3 weeks, I thus have only around 100 documents imported. Is there something that I can do to prioritise the import/OCR (perhaps bring the task into the foreground?

Thank you very much!
Cam

Hello,

As is, you can’t speed up or prioritize OCR. Whether the OCR runs in background or not doesn’t change its priority.

OCR doesn’t run sporadically. It runs immediately whenever you add or modify a document/page (or else it’s a bug).

However, workflows can be changed. For instance, if you want to import many documents:

  • you can disable the OCR in the settings (just select no languages at all)
  • import/scan the documents you want to add (Note: orientation detection doesn’t work without OCR)
  • enable back the OCR in the settings
  • then, go over each document and select the option “Redo OCR on document”. You can do it immediately on many documents: OCR requests will be queued and run one after the other.

On my side, there are changes I’m going to do:

  • when selecting multiple documents (see option “Add to selection”), I will add an option to redo the OCR on all the selected documents
  • when running the OCR on a document with multiple pages, OCR could be parallelized (depending on your computer CPU)

However I’m short on free time, so I’m not sure when I will be able to do those changes (I can’t say for sure whether it will be in Paperwork 2.1 or not).

That’s super: thanks for the reply! The “import without OCR” suggestion is helpful. Ultimately, the OCR parallelization will be the proper cure. We’ll look forward to it! Thanks again! Finally, thanks also for Paperwork.

when selecting multiple documents (see option “Add to selection”), I will add an option to redo the OCR on all the selected documents

Option has been added in branch ‘testing’ (build is running)

How can I export from all pages in the pdf only the OCRed text into a new docx or pdf?

Currently. you can’t. But I’m puzzled by your request. What would be the point of exporting only the text ?

Well I just need a GUI based app that can ocr my PDFs on linux( I just moved from Windows) and output all the OCRed text as a docx so I can edit ,add,remove sentences etc.
There is no GUI based up to date app that I know of that uses latest Tesseract version.Am not sure yours does. If you can give a better recommendation of an app that does this please let me know.