Change label_guessing_max_time and related parameters

eric · March 31, 2023, 2:07pm

Hello,

I’m getting these Warnings in my log:

WARNING] [paperwork_backend.guesswork.label.sklearn] Training is taking too long (10006ms > 10000ms). Interrupting

My current config is this:

$ paperwork-gtk config show
bug_report_protocol = https
bug_report_server = openpaper.work
check_for_update = False
label_guessing_batch_size = 200
label_guessing_max_doc_backlog = 100
label_guessing_max_time = 10
label_guessing_max_words = 15000
label_guessing_min_features = 10
last_update_found = 2.1.2
log_files = stderr
log_format = [%(levelname)-6s] [%(name)-30s] %(message)s
log_level = warning
main_window_size = [1652, 1013]
ocr_langs = ['deu', 'eng', 'hrv']
pageadd_active_source = feeder
pageadd_sources = ['flatbed', 'feeder']
scanner_calibration = None
scanner_dev_id = libinsane:sane:hpaio:/usb/Officejet_6700?serial=***
scanner_mode = Color
scanner_resolution = 300
scanner_source_id = None
send_statistics = True
settings_scanner_name = HP Officejet 6700
statistics_last_run = 2023-03-28
statistics_protocol = https
statistics_server = openpaper.work
sync_on_start = True
update_last_run = 1970-01-01
update_protocol = https
update_server = openpaper.work
uuid = ***
workdir = file:///home/***/papers

Yes, I have a lot of documents in the archive.

I think I should change label_guessing_max_time to higher value (e.g.15s / 15000ms). How can this be done (syntax of paperwork-gtk config put is unclear, what to use as type)?
Should I also change other label_guessing_* values?

Thank you in advance.
Eric.

jflesch · April 13, 2023, 5:41pm

Hello,

(sorry for the late reply)

You can increase the training time if you want. It is not required but it will make label guessing more accurate.

It’s actually not related to the number of documents you have. There is a limit of documents that are examined anyway: label_guessing_max_doc_backlog = 100 means that for each label, Paperwork will only train on the last 100 documents you have. The most likely explanation for the warning is just that your computer is too slow (CPU or disk).

Regarding the configuration, the thing that may be confusing is that values in the configuration have types (str, int, etc). So if you want to set a value, you have to specify the setting name, its type and the value. For example:

paperwork-cli config put label_guessing_max_doc_backlog int 1000
paperwork-cli config put label_guessing_max_time int 120
paperwork-cli config put log_level str error

eric · May 7, 2023, 12:12pm

Hello,
(no problem because of delays, I’m no better)

In the meantime, I think I have been able to locate the bottleneck. It was the hard disk system. After migrating to an SSD, the warnings no longer come up.

I have also successfully adjusted the label_guessing_max_time and label_guessing_max_doc_backlog.

It would be helpful for me to be able to look up the type for the value. Maybe with paperwork-gtk config show -? - which currently returns an error, so is not used. Or with paperwork-gtk config list_types.

But I still failing to set the log level.
I have now set this in the configuration:

log_files = /home/eric/.log/paperwork/paperwork-gtk.log
log_format = [%(levelname)-6s] [%(name)-30s] %(message)s
log_level = warning

The file paperwork-gtk.log was created in the directory and is also filled. However, [INFO] messages are still written.

When starting via CLI I see

$ paperwork-gtk 
[INFO ] [openpaperwork_core.config ] Loading configuration for paperwork2
[INFO ] [openpaperwork_core.config.backend.configparser] Loading configuration 'file:///home/eric/.config/paperwork2.conf' ...

and in /home/eric/.config/paperwork2.conf

...
[logging:paperwork-cli]
level = str:warning

[logging:paperwork-gtk]
files = str:/home/eric/.log/paperwork/paperwork-gtk.log

The level seems to have been set correctly. What am I doing wrong?

jflesch · May 10, 2023, 9:52pm

Good point. Fixed in the branch ‘develop’ (–> next release). I’ve simply added the types to the output of paperwork-cli config show (output of paperwork-json config show remains unchanged however).

You’ve set it for paperwork-cli, not paperwork-gtk. Most settings are shared between paperwork-gtk, paperwork-cli and paperwork-json, but the logging settings are not.

The root problem is simple: most users are likely to want to share scanner settings (for instance) between paperwork-gtk and paperwork-cli. But at the same time, by default, I want the log level set to INFO for paperwork-gtk (in case it segfaults, so user can easily get the last logs). But I cannot enable INFO logs for paperwork-cli by default (it would flood the output).

In other words, while paperwork-cli config put scanner_dev_id some_id is the same than paperwork-gtk config put scanner_dev_id some_id, paperwork-cli config put log_level warning is not the same than paperwork-gtk config put log_level warning

Yeah, I know, this is a bit messy and confusing, sorry :/. That’s something that I will have to improve later (if you have a better idea, I’m interested)

Anyway, for now, your safest bet when modifying Paperwork configuration is to always use the command for which you want to modify the settings. For instance here:

paperwork-gtk config put log_level str warning

eric · May 15, 2023, 6:02pm

I think I have now understood the principle. After adjusting log_files and log_level in all three commands, it behaves as desired.

If some settings are shared between all three commands and all others only by the respective command itself, I would suggest marking the shared settings in the output of “config show”, e.g. with a “+” at the beginning of the line. (The plus sign “+” is obviously not used in any setting; if it is used, use another sign).

I have a completely different request - if it makes sense, I would be happy to open a new topic for it: Would it be possible to use SMP with reasonable effort? Especially documents with many pages (e.g. 20 and more) block the front end for a very long time during import, while only one core of eight really has anything to do.