Encoding Error when running train-curve

Hi,

I was wanting to try the train-curve recipe but I get the followin error when I run it:

Traceback (most recent call last):
File "C:\Users\x\AppData\Local\Programs\Python\Python37\Lib\runpy.py", line 193, in _run_module_as_main
"main", mod_spec)
File "C:\Users\x\AppData\Local\Programs\Python\Python37\Lib\runpy.py", line 85, in run_code
exec(code, run_globals)
File "C:\Users\x.virtualenvs\prodigy_nightly_v3-x0wIMKXr\lib\site-packages\prodigy_main
.py", line 54, in
controller = recipe(args, use_plac=True)
File "cython_src\prodigy\core.pyx", line 329, in prodigy.core.recipe.recipe_decorator.recipe_proxy
File "C:\Users\x.virtualenvs\prodigy_nightly_v3-x0wIMKXr\lib\site-packages\plac_core.py", line 367, in call
cmd, result = parser.consume(arglist)
File "C:\Users\x.virtualenvs\prodigy_nightly_v3-x0wIMKXr\lib\site-packages\plac_core.py", line 232, in consume
return cmd, self.func(
(args + varargs + extraopts), **kwargs)
File "C:\Users\x.virtualenvs\prodigy_nightly_v3-x0wIMKXr\lib\site-packages\prodigy\recipes\train.py", line 331, in train_curve
config, gpu_id=gpu_id, overrides=overrides, silent=True
File "C:\Users\x.virtualenvs\prodigy_nightly_v3-x0wIMKXr\lib\site-packages\prodigy\recipes\train.py", line 172, in _train
spacy_train(nlp, output_path, use_gpu=gpu_id, stdout=stdout)
File "C:\Users\x.virtualenvs\prodigy_nightly_v3-x0wIMKXr\lib\site-packages\spacy\training\loop.py", line 91, in train
stdout.write(msg.info(f"Pipeline: {nlp.pipe_names}") + "\n")
File "C:\Users\x.virtualenvs\prodigy_nightly_v3-x0wIMKXr\lib\encodings\cp1252.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\u2139' in position 0: character maps to

I'm mostly posting this to report this issue. Since I'm not getting this error anywhere else (training, eval, ... all run fine), with the exact same dataset, I think it might be a bug or at least something that is being handled more gracefuly in other places?

Since it contains proprietary data, I won't be able to provide the dataset.

Hi! Thanks for the report. From looking at the traceback, it seems like the problem comes down to an encoding issue when printing the formatted logging messages to stdout, so it doesn't seem to be related to the data or anything like that. It's surprising, though, that it only happens in the train-curve and not anywhere else (because train-curve really just calls into train under the hood) :thinking:

What operating system and terminal are you using? And does setting the environment variable WASABI_LOG_FRIENLDLY=1 help (which will disable any pretty-printing, icons and colours in the terminal outut)?

I'm on Windows 10 and using Powershell. Setting $env:WASABI_LOG_FRIENDLY=1 doesn't seem to help.
I also tried it in the "normal" windows cmdline but I get the same result.

Thanks for the update! I wonder if this is related to this line we use in the train-curve workflow to write the output to devnull so the detailed logs aren't showing up for every run:

stdout = sys.stdout if not silent else open(os.devnull, "w")

Maybe there's something about writing to os.devnull in Windows that's different and needs special handling :thinking:

Could you check whether one of these (or both) fail for you with a similar error? If so, this is likely what's going on here.

import os
from wasabi import msg

stdout = open(os.devnull, "w")

# Test 1: just write something
stdout.write("hello world")
# Test 2: write formatted output
stdout.write(msg.info("hello world"))

First one works (prints 11), second one I can't run because I don't have msg. Is that a package I need to install?

Ah, that was supposed to be wasabi.msg, sorry! I'll edit my comment above.

Since wasabi.msg.info seems to return None, I can only run the line without stdout.write, which works and prints "? hello world"

Sorry, should have written this more carefuly. This is the correct test:

import os
from wasabi import Printer

stdout = open(os.devnull, "w")

msg = Printer(no_print=True)
stdout.write(msg.info("hello world"))

Yep, that recreated the error :slight_smile:

changing stdout to

stdout = open(os.devnull, "w", encoding="utf-8")

appears to resolve the issue for me.

2 Likes

Ahh of course, that makes a lot of sense! Thanks for helping with debugging. Just fixed this for the next nightly release :+1:

Edit: Fixed in v1.11.0a10: ✨ Prodigy nightly: spaCy v3 support, UI for overlapping spans & more - #85 by ines

3 Likes

A post was split to a new topic: Misaligned entities only in train-curve