What does "Already found in the index" mean?

JoshuaPK · March 3, 2020, 1:11am

Hello! When importing a large number of documents, what does the following line mean?

INFO paperwork_backend.docimport Document file:///home/josh/Documents/IncomingScanned/9.06.pdf already found in the index. Skipped

Does it mean that a document with the filename 9.06.pdf is already in the index? Or, does Paperwork do a checksum of the PDF file when it imports, and then check to see if that checksum is already in the database?

If it’s a filename rather than checksum then I may be in trouble. I do a huge number of scans in batches- so I may have files name 01.01.pdf to 12.01.pdf to 01.15.pdf to 10.15.pdf, and then I’ll import those and delete them from the IncomingScanned directory- then the next time I do batch scanning I’ll have repeats of the same filenames. Maybe that wasn’t such a good idea…

JoshuaPK · March 3, 2020, 1:22am

After looking at the backend I see it looks for the file hash, so that is good.

jflesch · March 3, 2020, 8:44am

Yep, it uses the hash of the PDF to detect duplicate.

It is a contribution from Josselin Jacquard, and I must say it’s very handy for me (and for others I assume): I place all my PDF to import in ~/tmp/pdf. After importing them I delete them, but sometimes I forget. So the next time I import PDF, it makes sure I don’t import again those files.

Using the file name wouldn’t work for me either. I have some companies that send me bills always named “bill.pdf”.