Encoding Error while running NER Correct on v1.13 on Windows 11

Hi,

Im experiencing a similar issue as discused on the thread Encoding Error when running train-curve

I get the error below when running ner.correct on prodigy v1.13 on windows 11.

================================= Traceback =================================

File "C:\Users\fabio\AppData\Local\Programs\Python\Python38\lib\runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
File "C:\Users\fabio\AppData\Local\Programs\Python\Python38\lib\runpy.py", line 87, in _run_code
    exec(code, run_globals)
File "C:\Users\fabio\.virtualenvs\dsar-wm91-sfZ\lib\site-packages\prodigy\__main__.py", line 63, in <module>
    controller = recipe(*args, use_plac=True)
File "C:\Users\fabio\.virtualenvs\dsar-wm91-sfZ\lib\site-packages\plac_core.py", line 367, in call
    cmd, result = parser.consume(arglist)
File "C:\Users\fabio\.virtualenvs\dsar-wm91-sfZ\lib\site-packages\plac_core.py", line 232, in consume
    return cmd, self.func(*(args + varargs + extraopts), **kwargs)

============================== Warning message ==============================

✘ 'charmap' codec can't decode byte 0x9d in position 6990: character
maps to < undefined >

I ran the test suggested in that thread and had the same result - specifying encoding="utf-8" didnt raise the error but with out it I get a

UnicodeEncodeError: 'charmap' codec can't encode character '\u2139' in position 9: character maps to < undefined >

Is there something I can do on my end to work around this issue?

Thanks
Fabio

1 Like

I upgraded to prodigy 1.13.1 but the issue still remains

Hi Fabio,

sorry for the late reply. We've not been able to reproduce this locally but that's possibly related to the fact that the Prodigy development team doesn't use Windows machines. However, you're not the first user with an issue that might be specific to Windows so we're exploring a way for internally to work on Windows bugfixes.

Will report back soon on this!

1 Like

Thanks Vincent,

Could it be a case of not explicitly specifying encoding="utf-8" while opening the jsonl file?

PEP 686 – Make UTF-8 mode default | peps.python.org.

Initially I converted my file from utf8 to cp1252 and that worked - but there would have been some chars that were lost.

I wasnt happy with the above so I did some more digging - Im using python 8.5 on my system and the locale.getpreferredencoding() returns 'cp1252'

Based on these:

I tried setting the default encoding used by Python to UTF-8, it worked for me by specifying -X utf8 after python but not when I tried to set the environment variable using

set PYTHONUTF8=1 

The below call to NER.Correct worked for me

python -X utf8 -m prodigy ner.correct [..other arguments]

Im not sure if this may break something else. But for the purpose of just running ner.correct it works. I will update the thread in case I run into any issues.

Thanks
Fabio

2 Likes

Hi Fabio,

that seems like a solid solution for now, nice find!

It's still hard for us to fully replicate the issue, but I can't think of a reason why your approach wouldn't work for the short-medium term.

In the long term we're interested in replacing plac with radicli in our codebase. I'd imagine that once that change is in this issue should also go away, but it's something we'll keep in the back of our mind.

2 Likes

@fabiolus Thank you for this. Had the same issue and error message with input that worked flawlessly in older versions of prodigy.
Using "pyhton -X utf8" did the trick.

1 Like