Proposals for optimization

AForster · February 22, 2023, 8:50am

Hello,
I am using currently Paperwork 2.1.1. I see two issues which consume lot of runtime and improvement could be useful:

When viewing an old document (e.g. from the year 2009) Paperwork spends a lot of time loading thumbnails. It seems it loads all thumbnails from today back to 2009. I would be more efficient to load only the thumbnails that are really needed to be displayed on the screen.
on startup, the identification of changes takes long, and a training of the label guesser is done. A general faster startup time would be helpful. E.g. the SW syncthing is able to detect changes in huge directories within a second. The SW is open source, maybe a “carry over” makes sense.

Kind regards,
Andreas

jflesch · March 1, 2023, 7:39pm

I’m going to have a look. It should be doable.
I would be surprised if there is any trick to what Syncthing does. Maybe it’s faster simply because it’s in Go instead of Python, maybe because it doesn’t use GLib’s GIO. Anyway, I’ll have a look whenever I have time.

Regarding the label guesser training, I’ve only learned a few months ago that sklearn models can actually be serialized and deserialized. Which means Paperwork could save the label guesser training on disk. It could help improve this part a little.

jflesch · March 1, 2023, 11:07pm

@AForster

For point 1, I’ve given it a try.

This change has been done on the branch ‘develop’. It will only be available in the build ‘master’ at the next minor release (Paperwork 2.2). In the meantime, if you want, you can already give it a try:

on Linux:
- Flatpak: flatpak --user install https://builder.openpaper.work/paperwork_develop.flatpakref
- AppImage: wget https://download.openpaper.work/linux/amd64/paperwork-gtk-develop-latest.appimage
on Windows: https://download.openpaper.work/windows/installer/paperwork_develop_installer.exe

It seems to improve things a bit, but it doesn’t fix everything. The root of the problem is that Paperwork uses a Gtk.ListBox to display the document list. Unfortunately, this widget doesn’t scale well (the Python Gtk.Layout that I have for each document row probably doesn’t help). So if you try to go to (let’s say) the 3000th document, things will still be really slow.
At some point, I will have to replace Gtk.ListBox by a custom implementation (Gtk.ListBox assumes each row has its own size, but I can assume they all have the same size).