Backlink to original PDF files used for import

tomf · January 4, 2021, 3:31pm

In the forum many people asked for a feature to remember and
show the original file name that was used during importing a
PDF file (and not using a scanner). My workflow is to first scan
the paper to SMB share because my scanners allow to scan to SMB
and the to import one or more PDF to Paperwork.

Luckily the original PDF file is copied by Paperwork to the
internal Paperwork directory structure and:
“Paperwork always keeps the original PDF file as is, even if
you edit some of its pages: the edited pages are stored
beside the PDF file.” [usage.tex]

So, the doc.pdf is just a (renamed) copy with identical date
and size. I experimented with finding pdf files with exact the
same date on the disk but it seems easier to look for the
identical size: on Linux just go to the internal Paperwork
directory of the wanted document and search using:

find ~ -size du -b doc.pdf|awk '{print $1"c";}'

(Or replace ~ by the alternative root/network share you need.)

And you get al list of all files having identical size to the
(renamed) copy “doc.pdf”. That are not to much
files even in my 180G home (surprisingly mostly unique).
Luckily the scanned and imported PDF files all have
different size due to encoding and compression.

For Win User I recommend TotalCommander and ALT-F7 and
extended search for size and/or date.

BR
Tom

BTW: Another way would be to externally parse through all
the log files of paperwork and extract all the sections looking
like

INFO ] [paperwork_gtk.docimport ] - Imported files: {‘file:///****/originalDocument.pdf’}
[INFO ] [paperwork_gtk.docimport ] - Non-imported files: []
[INFO ] [paperwork_gtk.docimport ] - New documents: {‘20201227_2354_23’}

to a file or database with two columns:
/****/originalDocument.pdf 20201227_2354_23

And this could be maybe implemented in Paperwork internal
routines: when imported something just append one line
in some import.log file. The “ignore everything that is already
imported” routine has likely already some check summed/hashed
index?

But with the known file name the user has to search/find the
file with that name which would be not easier the looking for
exact file size (or date).

tomf · January 4, 2021, 3:33pm

find ~ -size `du -b doc.pdf|awk '{print $1"c";}'`