Encoding Error when running train-curve

lnatprodigy · July 4, 2021, 10:27am

Hi,

I was wanting to try the train-curve recipe but I get the followin error when I run it:

Traceback (most recent call last):
File "C:\Users\x\AppData\Local\Programs\Python\Python37\Lib\runpy.py", line 193, in _run_module_as_main
"main", mod_spec)
File "C:\Users\x\AppData\Local\Programs\Python\Python37\Lib\runpy.py", line 85, in run_code
exec(code, run_globals)
File "C:\Users\x.virtualenvs\prodigy_nightly_v3-x0wIMKXr\lib\site-packages\prodigy_main.py", line 54, in
controller = recipe(args, use_plac=True)
File "cython_src\prodigy\core.pyx", line 329, in prodigy.core.recipe.recipe_decorator.recipe_proxy
File "C:\Users\x.virtualenvs\prodigy_nightly_v3-x0wIMKXr\lib\site-packages\plac_core.py", line 367, in call
cmd, result = parser.consume(arglist)
File "C:\Users\x.virtualenvs\prodigy_nightly_v3-x0wIMKXr\lib\site-packages\plac_core.py", line 232, in consume
return cmd, self.func((args + varargs + extraopts), **kwargs)
File "C:\Users\x.virtualenvs\prodigy_nightly_v3-x0wIMKXr\lib\site-packages\prodigy\recipes\train.py", line 331, in train_curve
config, gpu_id=gpu_id, overrides=overrides, silent=True
File "C:\Users\x.virtualenvs\prodigy_nightly_v3-x0wIMKXr\lib\site-packages\prodigy\recipes\train.py", line 172, in _train
spacy_train(nlp, output_path, use_gpu=gpu_id, stdout=stdout)
File "C:\Users\x.virtualenvs\prodigy_nightly_v3-x0wIMKXr\lib\site-packages\spacy\training\loop.py", line 91, in train
stdout.write(msg.info(f"Pipeline: {nlp.pipe_names}") + "\n")
File "C:\Users\x.virtualenvs\prodigy_nightly_v3-x0wIMKXr\lib\encodings\cp1252.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\u2139' in position 0: character maps to

I'm mostly posting this to report this issue. Since I'm not getting this error anywhere else (training, eval, ... all run fine), with the exact same dataset, I think it might be a bug or at least something that is being handled more gracefuly in other places?

Since it contains proprietary data, I won't be able to provide the dataset.

ines · July 5, 2021, 12:55am

Hi! Thanks for the report. From looking at the traceback, it seems like the problem comes down to an encoding issue when printing the formatted logging messages to stdout, so it doesn't seem to be related to the data or anything like that. It's surprising, though, that it only happens in the train-curve and not anywhere else (because train-curve really just calls into train under the hood)

What operating system and terminal are you using? And does setting the environment variable WASABI_LOG_FRIENLDLY=1 help (which will disable any pretty-printing, icons and colours in the terminal outut)?

lnatprodigy · July 5, 2021, 4:48am

I'm on Windows 10 and using Powershell. Setting $env:WASABI_LOG_FRIENDLY=1 doesn't seem to help.
I also tried it in the "normal" windows cmdline but I get the same result.

ines · July 5, 2021, 5:42am

Thanks for the update! I wonder if this is related to this line we use in the train-curve workflow to write the output to devnull so the detailed logs aren't showing up for every run:

stdout = sys.stdout if not silent else open(os.devnull, "w")

Maybe there's something about writing to os.devnull in Windows that's different and needs special handling

Could you check whether one of these (or both) fail for you with a similar error? If so, this is likely what's going on here.

import os
from wasabi import msg

stdout = open(os.devnull, "w")

# Test 1: just write something
stdout.write("hello world")
# Test 2: write formatted output
stdout.write(msg.info("hello world"))

lnatprodigy · July 5, 2021, 5:47am

First one works (prints 11), second one I can't run because I don't have msg. Is that a package I need to install?

ines · July 5, 2021, 5:49am

Ah, that was supposed to be wasabi.msg, sorry! I'll edit my comment above.

lnatprodigy · July 5, 2021, 5:55am

Since wasabi.msg.info seems to return None, I can only run the line without stdout.write, which works and prints "? hello world"

ines · July 5, 2021, 6:05am

Sorry, should have written this more carefuly. This is the correct test:

import os
from wasabi import Printer

stdout = open(os.devnull, "w")

msg = Printer(no_print=True)
stdout.write(msg.info("hello world"))

lnatprodigy · July 5, 2021, 6:15am

Yep, that recreated the error

changing stdout to

stdout = open(os.devnull, "w", encoding="utf-8")

appears to resolve the issue for me.

ines · July 5, 2021, 7:09am

Ahh of course, that makes a lot of sense! Thanks for helping with debugging. Just fixed this for the next nightly release

Edit: Fixed in v1.11.0a10: ✨ Prodigy nightly: spaCy v3 support, UI for overlapping spans & more - #85 by ines

ines · July 6, 2021, 12:25am

A post was split to a new topic: Misaligned entities only in train-curve

Topic		Replies	Views
Misaligned entities only in train-curve ner , nightly	4	810	July 8, 2021
UnicodeDecodeError in plotext library third-party , training	2	606	November 24, 2021
UnicodeEncodeError during training ner , spacy , solved	6	2091	November 13, 2018
Encoding Error while running NER Correct on v1.13 on Windows 11 solved	5	425	September 19, 2023
Error while trying to train: 'utf-8' codec can't decode usage , solved , windows	4	1902	November 18, 2021

Encoding Error when running train-curve

Related topics