Paperwork - test - my wishes - or any hint for me?

Hello,

Yesterday I tried out Paperwork. My concern is that I have a lot of old technical PDF documents (>10000) with only image content. My idea was and is to tag them.

No sooner thought than searched, Paperwork found, installed and 300 PDF files imported.

The OCR took a few hours, but that wouldn’t be a problem for me.

The results are pretty good for documents in portrait format. Not perfect, which is not to be expected, but useful for me. The query language with the logic operator is excellent.

But now to the problems and my two questions: whether I could change something here in the settings or whether Paperwork is considering further development.

A) Unfortunately, some pages in landscape format in the PDF files are not recognised at all, which limits the search in detail.

B) What I don’t like at all is the naming of the documents with the date. Usually, I have many search results. Keeping the original name of the PDF file and displaying it in the search results would help me a lot more to find the right document. I understand the thinking behind using the date for commercial matters, for example, but I certainly can’t handle >10000 technical documents with a date as their name.

Is there any way I can use paperwork in a way that makes sense to me?

Thank you!

All the best!

Dening

Hello,

A) Unfortunately, some pages in landscape format in the PDF files are not recognised at all, which limits the search in detail.

Do you have any document you could provide as example ?
FYI, while Paperwork doesn’t run the page orientation detection on PDF, you can edit the PDF yourself in Paperwork (rotate pages, etc). The original PDF won’t be modified. Paperwork will just create a PNG beside the original PDF for each modified page.

B) What I don’t like at all is the naming of the documents with the date. Usually, I have many search results. Keeping the original name of the PDF file and displaying it in the search results would help me a lot more to find the right document. I understand the thinking behind using the date for commercial matters, for example, but I certainly can’t handle >10000 technical documents with a date as their name.

Is there any way I can use paperwork in a way that makes sense to me?

To be blunt, Paperwork is designed for recurrent personal documents (bills, payslips, etc). Not really for technical documentations. I’m not sure that it fits your use case.

In the mockups made by Mathieu Jourdan a long time ago, he suggested that Paperwork display the PDF titles beside the labels. I’m not fond of the idea as it doesn’t fit the main goal of Paperwork, but I guess that it’s something I will have to consider.

Hello,

Thanks for your reply.

As an example, I took a detailed look at a document with the ‘problems’ in landscape format. Well, that’s obviously difficult for Tesseract, because the document has a technical orientation of portrait format over all pages, but some pages in the original paper are in landscape format, but have been scanned in portrait format anyway. Just as the feeder scanner had scanned them. Yes, I am aware of the page rotation, but manually flipping through more than 10,000 PDFs is a lot of work. However, I now think that the problem is more on my side than on the application side.

Showing the pdf file name / title additionally would be a wonderful solution.

There’s no rush :wink:

Thank you and best regards,

Dening

Maybe you have a look at ocrmypdf. I have used it to batch convert a lot of old PDFs in my archives folder before feeding them into PaperWork. Automatic rotation and parallel OCR save a lot of time and effort.