Importing existing archive

glatzor · January 16, 2023, 4:05pm

Hello,
I would like to migrate my custom document archive to paperwork (I created the documents with my custom script GitHub - glatzor/scan2pdf: Command line tool to scan pages to PDF).
How could I perform a mass import reusing the timestampe and tags used in the file name. The file is named: YYYY-MM-DD_tag1_tag2_tag3.pdf
Is there any easy way? Or do I have to parse the output of the cli command manually for the new document id and apply date and labels in a second step?
Regards,
Sebastian

jflesch · January 16, 2023, 5:16pm

Hello,

My suggestion would be to make a shell script that would do the following:

disable Paperwork’s automatic labeling of documents
for each document:
- import the documents, reusing the date specified in its current file name
- apply the tags specified in its current file name
enable back the automatic labeling

For that, you can use paperwork-cli/paperwork-json. While paperwork-cli is made to be human-friendly, paperwork-json is designed to be used in shell scripts with jq.

You can specify the Paperwork document ID you want when importing. The document ID in Paperwork have the format YYYYMMDD_hhmm_ss[_nn] (ex: 20120628_1942_45 ; nn is used in case of collision).

Splitting strings with bash is tricky. I for one would use the powers of $IFS to split it.

I guess it would look something like:

# just because better safe than sorry
paperwork-cli sync

# disable automatic label guessing for paperwork-cli
paperwork-cli plugins remove paperwork_backend.guesswork.label.sklearn

backup_IFS="$IFS"

for doc_filepath in ${your_file_list}; do
    date=""
    year=""
    month=""
    days=""
    labels=""

    # parse your current filenames
    doc_filename=$(basename -- "$doc_filepath")
    doc_filename="${doc_filename%.*}"

    IFS='_'
    for field in ${doc_filename} ; do
        if [ -z "${date}" ] ; then date="${field}"
        else labels="${labels} ${field}"
        fi
    done
    IFS='-'
    for field in ${date} ; do
        if [ -z "${year}" ] ; then year="${field}"
        elif [ -z "${month}" ] ; then month="${field}"
        elif [ -z "${day}" ] ; then day="${field}"
        fi
    done

    IFS="${backup_IFS}"

    # import in Paperwork
    paperwork_doc_id="${year}${month}${day}_0000_00"
    paperwork-cli import "${doc_filepath}" --doc_id ${paperwork_doc_id}
    for label in ${labels} ; do
        paperwork-cli label add ${paperwork_doc_id} ${label}
    done
done

# re-enable the automatic label guessing
paperwork-cli plugins reset

(I haven’t actually entirely tested it)

Label colors will be generated randomly (unless you use -c), but they can easily be changed in paperwork-gtk afterward.

Beware that unfortunately paperwork-XXX plugins remove was broken in the latest release of Paperwork (2.1.2) and was fixed right after the release . You can however get fresh builds including this fix as AppImage, Flatpak or from sources.

Assuming you have many documents the same day (likely), there will be document id collisions