I think I should change label_guessing_max_time to higher value (e.g.15s / 15000ms). How can this be done (syntax of paperwork-gtk config put is unclear, what to use as type)?
Should I also change other label_guessing_* values?
You can increase the training time if you want. It is not required but it will make label guessing more accurate.
It’s actually not related to the number of documents you have. There is a limit of documents that are examined anyway: label_guessing_max_doc_backlog = 100 means that for each label, Paperwork will only train on the last 100 documents you have. The most likely explanation for the warning is just that your computer is too slow (CPU or disk).
Regarding the configuration, the thing that may be confusing is that values in the configuration have types (str, int, etc). So if you want to set a value, you have to specify the setting name, its type and the value. For example:
paperwork-cli config put label_guessing_max_doc_backlog int 1000
paperwork-cli config put label_guessing_max_time int 120
paperwork-cli config put log_level str error
Hello,
(no problem because of delays, I’m no better)
In the meantime, I think I have been able to locate the bottleneck. It was the hard disk system. After migrating to an SSD, the warnings no longer come up.
I have also successfully adjusted the label_guessing_max_time and label_guessing_max_doc_backlog.
It would be helpful for me to be able to look up the type for the value. Maybe with paperwork-gtk config show -? - which currently returns an error, so is not used. Or with paperwork-gtk config list_types.
But I still failing to set the log level.
I have now set this in the configuration:
Good point. Fixed in the branch ‘develop’ (–> next release). I’ve simply added the types to the output of paperwork-cli config show (output of paperwork-json config show remains unchanged however).
You’ve set it for paperwork-cli, not paperwork-gtk. Most settings are shared between paperwork-gtk, paperwork-cli and paperwork-json, but the logging settings are not.
The root problem is simple: most users are likely to want to share scanner settings (for instance) between paperwork-gtk and paperwork-cli. But at the same time, by default, I want the log level set to INFO for paperwork-gtk (in case it segfaults, so user can easily get the last logs). But I cannot enable INFO logs for paperwork-cli by default (it would flood the output).
In other words, while paperwork-cli config put scanner_dev_id some_id is the same than paperwork-gtk config put scanner_dev_id some_id, paperwork-cli config put log_level warning is not the same than paperwork-gtk config put log_level warning
Yeah, I know, this is a bit messy and confusing, sorry :/. That’s something that I will have to improve later (if you have a better idea, I’m interested)
Anyway, for now, your safest bet when modifying Paperwork configuration is to always use the command for which you want to modify the settings. For instance here:
I think I have now understood the principle. After adjusting log_files and log_level in all three commands, it behaves as desired.
If some settings are shared between all three commands and all others only by the respective command itself, I would suggest marking the shared settings in the output of “config show”, e.g. with a “+” at the beginning of the line. (The plus sign “+” is obviously not used in any setting; if it is used, use another sign).
I have a completely different request - if it makes sense, I would be happy to open a new topic for it: Would it be possible to use SMP with reasonable effort? Especially documents with many pages (e.g. 20 and more) block the front end for a very long time during import, while only one core of eight really has anything to do.