Hello,
I would like to migrate my custom document archive to paperwork (I created the documents with my custom script GitHub - glatzor/scan2pdf: Command line tool to scan pages to PDF).
How could I perform a mass import reusing the timestampe and tags used in the file name. The file is named: YYYY-MM-DD_tag1_tag2_tag3.pdf
Is there any easy way? Or do I have to parse the output of the cli command manually for the new document id and apply date and labels in a second step?
Regards,
Sebastian
Hello,
My suggestion would be to make a shell script that would do the following:
- disable Paperwork’s automatic labeling of documents
- for each document:
- import the documents, reusing the date specified in its current file name
- apply the tags specified in its current file name
- enable back the automatic labeling
For that, you can use paperwork-cli
/paperwork-json
. While paperwork-cli
is made to be human-friendly, paperwork-json
is designed to be used in shell scripts with jq
.
You can specify the Paperwork document ID you want when importing. The document ID in Paperwork have the format YYYYMMDD_hhmm_ss[_nn]
(ex: 20120628_1942_45
; nn
is used in case of collision).
Splitting strings with bash is tricky. I for one would use the powers of $IFS
to split it.
I guess it would look something like:
# just because better safe than sorry
paperwork-cli sync
# disable automatic label guessing for paperwork-cli
paperwork-cli plugins remove paperwork_backend.guesswork.label.sklearn
backup_IFS="$IFS"
for doc_filepath in ${your_file_list}; do
date=""
year=""
month=""
days=""
labels=""
# parse your current filenames
doc_filename=$(basename -- "$doc_filepath")
doc_filename="${doc_filename%.*}"
IFS='_'
for field in ${doc_filename} ; do
if [ -z "${date}" ] ; then date="${field}"
else labels="${labels} ${field}"
fi
done
IFS='-'
for field in ${date} ; do
if [ -z "${year}" ] ; then year="${field}"
elif [ -z "${month}" ] ; then month="${field}"
elif [ -z "${day}" ] ; then day="${field}"
fi
done
IFS="${backup_IFS}"
# import in Paperwork
paperwork_doc_id="${year}${month}${day}_0000_00"
paperwork-cli import "${doc_filepath}" --doc_id ${paperwork_doc_id}
for label in ${labels} ; do
paperwork-cli label add ${paperwork_doc_id} ${label}
done
done
# re-enable the automatic label guessing
paperwork-cli plugins reset
(I haven’t actually entirely tested it)
Label colors will be generated randomly (unless you use -c
), but they can easily be changed in paperwork-gtk
afterward.
Beware that unfortunately paperwork-XXX plugins remove
was broken in the latest release of Paperwork (2.1.2) and was fixed right after the release . You can however get fresh builds including this fix as AppImage, Flatpak or from sources.
Assuming you have many documents the same day (likely), there will be document id collisions