Heads up: Paperwork 1.99 + Request for comments

jflesch · April 12, 2020, 8:17pm

Hello,

Good news everyone ! Today I’ve tagged Paperwork 1.99. It’s what-will-be-Paperwork-2.0-but-is-not-yet-finished. And I would like your comments.

I’ve tagged this version 1.99 because it has reached a point where it is “usable”: All of the main features of Paperwork 1.3.1 are available … they may just not be really stable yet :). It is now my main document manager (in other words, I don’t use Paperwork 1.3.1 anymore at all).

There are also a bunch of improvements compared to Paperwork 1.3.1 (updated 2020/7/3):

Entirely modular design:
- Pretty much any features can be disabled
- Code is cleaner, therefore contributions will probably be easier (once the documentation has been written)
- You can expect faster and easier improvements later on
Better performances:
- https://peertube.video/videos/watch/ae542b4f-c18b-4758-b45b-0126d6a5691e
- https://peertube.video/videos/watch/5253f1f3-7347-49e9-b996-723b81ce69c0
Command-line interface that kicks ass.
Semi-automatic bug report system that kicks ass (I’m going to need it …)
Editable PDFs
Smaller exported PDFs
Automatic scan border guessing that … doesn’t really kick ass …

Still missing (updated 2020/7/5):

~~Advanced search dialog~~
~~Translations~~ (actual translations will be done in testing phase)
~~Welcome document~~
~~Help menu~~
~~Drag-n-drop of pages inside PDFs~~
~~Multiple document selection~~
~~Windows portage~~
~~Commands to modify the plugin list don’t handle dependencies yet~~
~~Automatic reset of the plugin list when Paperwork is updated~~
~~New version detection + a welcome page/screen~~
~~Text copy~~
~~Keyboard shortcuts (Ctrl-F, etc)~~
~~Search keyword suggestions~~
~~Flatpak: Popup in the settings when no scanner is found regarding configuration~~
~~Import on file drop~~
~~Dialog ‘about’~~
~~Dialog in case of uncaught exception (“unexpected error”)~~
Code documentation (will be done during testing phase)
User documentation (will be done during testing phase)
Search in page: next/previous (#480) (maybe only for Paperwork 2.1)
Dialog for multiple-scans (maybe only for Paperwork 2.1)

I’ve also fixed a big bunch of minor defects along the way. For instance drag-n-drop behavior is now a bit more consistent ; you can even drop a Paperwork page in Firefox (fairly useless, but meh )

Anyway since it’s a full rewrite, I would appreciate it if some of you could give it a quick try, see if you can find any missing features or bugs.

To try it, the easiest solution is to use Flatpak. You can just follow the usual instructions and replace every occurrences of master by develop.
Please do backup your documents beforehand. It won’t ever delete them by accident, but it will resize down thumbnails generated by Paperwork 1.3.1.
To uninstall it: flatpak uninstall work.openpaper.Paperwork//develop

Thanks in advance,

AForster · April 18, 2020, 8:30pm

Hello Jerome,

thanks for your work and for increasing the maturity of Openpaper work so good.

I succeeded to install it in a Virtualbox, because on my production system the error with the GPG signature showed up (we had this in October/November last year already).

To install it on Linux Mint, at least the installation of the following gnome package must be done before:
flatpak install flathub org.gnome.Platform//3.36

There are some findings:

Scanning produces an exception. I filed the bug report a few minutes ago. In the setup menu Paperwork detects that I have the Samsung C48X connected, but scanning is not possible
When editing the scanned image, now the buttons are on the left, including the zoom-bar. This is slightly irritating, but maybe it is just to get used to it
Going into the documents properties requires one mouse click more compared to 1.3.0 - why not just use the known button? Having the printing and export options in this menu is good, but the properties shoould be accessible directly from the thumbnail
The way to select between Scanning and importing is new. It is good to have “import” there as a button when you select it in the dropdown menu. But in my opinion, there is enough space in the top bar to have both button for scanning and importing there all the time
Question: Was is the [i] button and the threedots menu on the top for?
Is it intended or a mistake that the window does not show the minimize and restore icons on the top?

Regarding Speed: congratulations, that is really well done. Allthough I am in a Virtualbox Environment with limitations, searching was faster than on my productive system. Also on import, the rotation and OCR is very fast.

Hoping for the next release candidate, and with greetings!
Kind regards,
Andreas

AForster · April 18, 2020, 8:38pm

Hi Jerome,

one more issue:
allthough the drag and drop feature improved in general, the pages are shown wrong after the drop. You need to re-open the document to seem them correctly.
E.g.: I have a document with 4 pages. I drag page 4 and drop it to 2. Then the former page 4 is shown as page 2 (correct), the former page 3 is shown as page 4 (correct). But the page 3 is not updated. It should show the old page 2, but still shows the old page 3.

And I did not succeed to drop a page into a new document.

Kind regards,
Andreas

jflesch · April 19, 2020, 2:36pm

To install it on Linux Mint, at least the installation of the following gnome package must be done before:
flatpak install flathub org.gnome.Platform//3.36

Weird, Flatpak should have suggested it automatically. I’ll have a look at it later.

Scanning produces an exception. I filed the bug report a few minutes ago. In the setup menu Paperwork detects that I have the Samsung C48X connected, but scanning is not possible

It should be fixed now. You can run flatpak --user update.
The problem was that your scanner or Sane backend didn’t let Paperwork set the scan mode (Color/Grayscale/Black&White, etc)

Going into the documents properties requires one mouse click more compared to 1.3.0 - why not just use the known button? Having the printing and export options in this menu is good, but the properties shoould be accessible directly from the thumbnail

I think this is something that still need some discussions with Mathieu Jourdan. Worst case scenario, it remains as is for Paperwork 2.0, and I will change it later for Paperwork 2.1.

The way to select between Scanning and importing is new. It is good to have “import” there as a button when you select it in the dropdown menu. But in my opinion, there is enough space in the top bar to have both button for scanning and importing there all the time

This is a long and still on-going debate: Replace the button "scan" (#537) · Issues · World / OpenPaperwork / paperwork · GitLab
For now, this is how it is going to be for Paperwork 2.0.
Later using the plugin system, I intend to let people try other setups. We will see if one of them eventually gets consensus.

Question: Was is the [i] button and the threedots menu on the top for?

[i]: Document informations (for instance PDF meta-data, etc). Not yet implemented. Not sure yet whether I will implement it for Paperwork 2.0 or just remove the button for now.
three-dots button: Will be used when selecting multiple documents. Not yet implemented either. And I’m not sure either whether I’ll implement it for Paperwork 2.0 or not.

Is it intended or a mistake that the window does not show the minimize and restore icons on the top?

That’s the default Gnome shell configuration. To be honest, I haven’t checked yet if they are correctly displayed with other window manager. It may be a bug. I’ll have look later.

Regarding Speed: congratulations, that is really well done.

\o/

allthough the drag and drop feature improved in general, the pages are shown wrong after the drop. You need to re-open the document to seem them correctly.

I’ll have a look. Looks like a bug.

mstein · April 19, 2020, 3:55pm

Hello Jerome,

very cool to see that the 2.0 is becoming reality! It’s soo much quicker! I don’t know what you did, it’s cool:-)
There are two small things where I think the workflow became a little more effort:

Editing Labels/change document date: If you compare the current selected document on the left: In the 1.3 there were two buttons, one for propertys, one for deleting. I often use these propertys. Now in 2.0 it takes one extra click on the three dots to open the menu. There I find things I personally rarely use: Export, print, open folder, redo OCR. It would be great if the property button would find its way back into the main view. I edit nearly every date, so this is a little painfull for me to do the extra click.
Cosmetic: When cutting a scan you need to click edit on the bottom right, click on the cissors on the top left and confirm on the top right. Can you please keep one place where I need to look for buttons?
The zooming with Ctrl + Mousewheel is doing very small steps. It feels that 1 scroll is doing like +2%. With the paperwork 2.0 I need to do around 5 scrolls (scroll ticks, how ever they are named^^) to zoom what I expect to happen when I do a single scroll.

Info: I installed paperwork via flatpak on Arch linux. Works out of the box, thanks for your documentation.

Don’t get me wrong: I loved the 1.3 already. The autotagging and fulltext search are very cool, and all that by not giving all data to evernote or similar. I’m a big fan of your project!

Manuel

jflesch · April 20, 2020, 5:44pm

@AForster

I had a look at the bug report you submitted. Looks like the root cause of your problem was a crash of Tesseract:

pyocr.error.TesseractError: (-11, b'Tesseract Open Source OCR Engine v4.0.0 with Leptonica\ncontains_unichar_id(unichar_id):Error:Assert failed:in file ../../src/ccutil/unicharset.h, line 502\n')

I have to fix 2 things:

I’ll update the version of Tesseract that is currently provided with the Flatpak version of Paperwork
Paperwork must handle crash of Tesseract more gracefully (you won’t get the text anyway, but the GUI shouldn’t go haywire)

(I guess I also need to add a way to reply to bug reports ).

@mstein

Editing Labels/change document date (…)

Yeah, you’re the third person to tell me something similar
It’s fixed. You can run flatpak --user update work.openpaper.Paperwork.

When cutting a scan you need to click edit on the bottom right, click on the cissors on the top left and confirm on the top right

I’ve moved the edition tool bar to the right. I think that should fix this problem.

The zooming with Ctrl + Mousewheel is doing very small steps

I’ll have a look asap.

Don’t get me wrong: I loved the 1.3 already

Don’t worry, I did ask for feedback. This is the perfect time to be nice by being mean

sebmacc · April 21, 2020, 9:19am

Hello !

I’m currently trying Paperwork Pre-2.0

When I want to change the work directory, in the settings dialog, I click on the button in the “Work directory” box, the window to select the work directory doesn’t get the focus and cannot get it. I have to close the Settings window to be able to interact with the file selection dialog.

Now that I have changed this setting, paperwork is eating my CPU, I’m waiting for it to finish its indexing-or-something-else work, then I’ll test it and I’ll come back here afterwards

sebmacc · April 21, 2020, 9:59am

Performance is amazing

Globally it seems to work well, but I have only tested reading operations, no import…

jflesch · April 21, 2020, 5:51pm

I click on the button in the “Work directory” box, the window to select the work directory doesn’t get the focus and cannot get it.

Sorry. It’s a bug I’ve introduced yesterday evening. It should be fixed now → flatpak --user update work.openpaper.Paperwork.

jflesch · April 22, 2020, 3:21pm

@AForster : Hm, yeah, drag-n-drop in PDF is not really supported at the moment (there is a missing piece). I’ll fix that later.

AForster · April 22, 2020, 7:17pm

Hi Jerome,

I see Paperwork making big steps ahead, very good progress.

Here are further comments, and I will elaborate a little bit longer about the drag&drop thing.

The automatic Crop Detection is very slow in my environment, takes minutes, with unsatisfying results. I propose to make it an option in the settings with is marked experimental and off by default. This opens the possibility to improve the libpillowfight at a later time and then enable the feature.
You improved the scanning of the next page while the recent scanned page(s) is/are still processed. That is very good, I like this feature because I can then scan pages one by another and let the computer do his job in the background. Thanks for this improvement. By the way, what happens if the user exits the SW while the background tasks are not yet finished?
Importing a folder with mixed content of pdf and images: In Version 1.x there was a dialog asking which action shall be taken - this is missing in 2.0. By default I would expect to have one document per pdf and all images in one document. But this is because I know how Paperwork works. Other users might not understand this. There could be a guessing algorithm like foo1.jpg and foo2.jpg go into one document, while bar.jpg and bar_something.jpg go into another document.
Also for importing: when I view a document and import images, they will be appended at the end. When I import a pdf it is put into a new document. Again, I understand that this is because of the way the SW works. But the user might not, so at least an explanatory message might help
A really cool feature, and I like it, is that for importing pdfs the page orientation is corrected and the editing is possible. Well done!. The logic behind is fantastic, but brings also some tweaks: As paper.x.jpg is overlaying the pdf page, this page is treated like an image. I can also drop it to another place. But this is not giving the result the user expects, because the page x will be displayed twice, while another page disappears.
Drag and Drop for PDFs as indicated above is possible in the GUI, but no change in the backend. That is irritating the user. Either you indicate e.g. by a red bar that it is not possible for pdf, or you convert all relevant pages from PDF to jpg. But attention, then Paperwork leaves the philisophy of keeping the files in the original format, and it becomes more and more a PDF editing tool. That would be cool, but must be designed very well, and bears risks.
Another really cool feature is that you can scan additional pages into a document that was initially pdf. Also dropping pages from other documents works fine. The inverse way is not possible, to append pages from a pdf to a document that has jpgs for the first page(s). Again, I understand why, but for the user it will not be intuitive. To solve this, you would need a way to identify that the doc.pdf starts at page n. Could be something like doc.n.pdf is a pdf starging at page n, and a doc.m.pdf in the same directory starts at page m. What then if the user drags&drops a page? Move the whole pdf? Inhibit moving pages of pdfs? User might not understand…
when drag&dropping page to another document, so that the current document is empty, it is not immediately disappearing from the thumbnail-bar. The folder on the disk seems to be deleted, but the thumbnail disappears like few minutes later
When importing or scanning, the newly created document appears in the thumbnail bar only after the ocr is finished. In think it should appear immediately
Your change to Tesseract 4.11 gave improvements in stability. But I observed that it consumes a lot of CPU and apparently does not proceed the files while you view the document that is processed. E.g. when redo OCR, only after I view another document the background tasks really runs.

So, that was a really long list of comments and lots of findings. Please don’t feel discouraged. Paperwork is really in a good shape. But we have to identify all possible actions, which users will do. So I play around really much with the tool (not in my productive environment!!), because I like it and want to make it better.

Kind regards,

Andreas

jflesch · April 26, 2020, 10:04pm

@andreas PDF drag-n-drop should work once this build pipeline is finished. (The code for that is horribly complex, so I guess there will be bugs :/)

jflesch · April 27, 2020, 9:28am

The automatic Crop Detection is very slow in my environment, takes minutes, with unsatisfying results.

I’ll have a look to speed it up. In the meantime, what kind of CPU do you have ?

By the way, what happens if the user exits the SW while the background tasks are not yet finished?

For most background tasks, Paperwork main window will disappear but Paperwork will actually keep running until all the background tasks have finished.

Importing a folder with mixed content of pdf and images: In Version 1.x there was a dialog asking which action shall be taken - this is missing in 2.0.

Yeah, there is a big “TODO” in the code. To be honest, I didn’t think anyone would notice

when I view a document and import images , they will be appended at the end. When I import a pdf it is put into a new document.

This was the behavior in Paperwork 1.3 and it will be in Paperwork 2.0. We can discuss it later in a separate topic or Gitlab ticket if you want.

Note that it’s not just a limitation of the backend. PDF are supposed to be whole documents, so it makes sense to add them to the work directory as separate documents.

As paper.x.jpg is overlaying the pdf page, this page is treated like an image. I can also drop it to another place. But this is not giving the result the user expects, because the page x will be displayed twice, while another page disappears.
(…)
Drag and Drop for PDFs as indicated above is possible in the GUI, but no change in the backend.

It should work now.

when drag&dropping page to another document, so that the current document is empty, it is not immediately disappearing from the thumbnail-bar. The folder on the disk seems to be deleted, but the thumbnail disappears like few minutes later
(…)
When importing or scanning, the newly created document appears in the thumbnail bar only after the ocr is finished. In think it should appear immediately

I’ll have a look. But basically it seems there are a bunch of operations where the document list should be reloaded sooner.

Tesseract 4.11 (…) But I observed that it consumes a lot of CPU

Well, that’s OCR for you …
More seriously, there is nothing I can do about that, except keeping Tesseract up-to-date and hoping they optimize things later.

apparently does not proceed the files while you view the document that is processed.

I doubt it’s what happened. Viewing the document does not lock it in any way.
However another background task was maybe preventing the OCR task from running (those kind of background tasks are queued, and not all background tasks are displayed yet)

AForster · May 2, 2020, 7:50am

I’ll have a look to speed it up. In the meantime, what kind of CPU do you have ?

I have an amd64 A8-6600K, 4 Cores. But I am using it in a Virtualbox, so that the performance is decreased, allthough I give the Virtualbox all 4 Cores and 4GB of Memory.
Nevertheless, it seems that you have increased the performance very well. Importing pdf worked with autocrop. Allthough, Scanning and importing jpegs ends in an exception, I filed a bugreport.

This was the behavior in Paperwork 1.3 and it will be in Paperwork 2.0. We can discuss it later in a separate topic or Gitlab ticket if you want.

OK, that is accepted for the moment, as it was the behaviour in 1.3

Yes, great feature!

So, we have an overall great improvement and I like Paperwork 2.0 I tend to use it for my productive environment. I am going to to further testing as soon as the scanning works again (is it related to my scanner??)

Kind regards,

Andreas

jflesch · May 2, 2020, 9:24am

Nevertheless, it seems that you have increased the performance very well. Importing pdf worked with autocrop.

Hm, actually, PDF are not cropped at all when imported. And on scanned images, there is no autocropping either: the scanner calibration is used instead.
The only time where there is autocropping is to preselect cropping area:

when you calibrate the scanner in the settings
when you edit a page and enable cropping

Allthough, Scanning and importing jpegs ends in an exception, I filed a bugreport.

Yep, got it. Sorry, I was so focused on page order management in the PDF files that I didn’t see that I broke image document creation (image import is broken too).

Fix is being built and uploaded to the Flatpak repository.

AForster · May 2, 2020, 4:34pm

Hi Jerome,

I have observed two more bugs in my environment:
After cropping, the OCR is not redone, so that the words are highlighted on the wrong positions.
An when scanning, the background tasks get stuck with a task called “Scanning page 1” or similar, and the OCR is done only on the first page.

Kind regards,
Andreas

jflesch · May 3, 2020, 9:40am

After cropping, the OCR is not redone, so that the words are highlighted on the wrong positions.

Good catch. Fixed.

An when scanning, the background tasks get stuck with a task called “Scanning page 1” or similar, and the OCR is done only on the first page.

When a background task seems to be stuck, it’s usually that it ended on an unexpected error (uncaught exception).
Later I will add a popup when an unexpected error happen, that should make things clearer.

Anyway, I’ll have a look. Batch scanning is something I haven’t tested much yet.

jflesch · May 3, 2020, 11:44am

An when scanning, the background tasks get stuck with a task called “Scanning page 1” or similar

Ok, this one was actually a timing issue :). It’s fixed.

mstein · May 17, 2020, 10:40pm

Hello Jerome,

once more thank for your work. My roommate asked me how I organize my documents. So I just recommended paperwork

GUI detail: in the properties menu there is an edit field below the label “Additional keywords”. It is fully functional but with my settings (default I guess?) the borders of the edit field are not visible and has the same color as the background − so it is ‘invisible’. As soon as I type the font is visible but I’m missing a white rectangle on a gray ground maybe with some borders.
Bug: I’m mixing scans (jpg) and pdf: An assurance sent me both an original and a pdf. I wrote notes on the first page and scanned it. I’m lazy and don’t want to scan all 20 pages as they sent me the same as pdf but without my notes of course.
1. So I import the pdf (DocumentA). That works.
2. Scan one page (DocumentB)
3. Move the page of DocumentB) into the pdf doc (DocumentA). Now they are “merged”. There are 20pages pdf + 1 page jpg. Works.
4. Restart paperwork. Works.
5. Move the 1 jpg to the front so that I have: 1 page jpg + 20pages pdf. When I restart paperwork now the document becomes unselectable, no page appears, only the correct number of pages is shown. All other documents run fine, only this jpeg + pdf combination is broken.

If I described it too complicated or if it is not reproducible on the first try just leave me a note. If you like I’ll file an issue and boil it down to a minimal working example with a folder where I have the bug.

I updated this evening (develop branch). Not sure if this bug has something to do with the folder import of mixed filetypes discussed above.

Kind Regards, Manuel

jflesch · May 18, 2020, 8:46pm

there is an edit field below the label “Additional keywords”. It is fully functional

Yes it is, but I still need to fine-tune the look-and-feel of it I guess.

When I restart paperwork now the document becomes unselectable, no page appears, only the correct number of pages is shown.

I’ll have a look, but later. Right now I’m working on another personal project of mine (I need a break from Paperwork for a few days).