Encoding Error while running NER Correct on v1.13 on Windows 11

fabiolus · August 30, 2023, 4:53am

Hi,

Im experiencing a similar issue as discused on the thread Encoding Error when running train-curve

I get the error below when running ner.correct on prodigy v1.13 on windows 11.

================================= Traceback =================================

File "C:\Users\fabio\AppData\Local\Programs\Python\Python38\lib\runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
File "C:\Users\fabio\AppData\Local\Programs\Python\Python38\lib\runpy.py", line 87, in _run_code
    exec(code, run_globals)
File "C:\Users\fabio\.virtualenvs\dsar-wm91-sfZ\lib\site-packages\prodigy\__main__.py", line 63, in <module>
    controller = recipe(*args, use_plac=True)
File "C:\Users\fabio\.virtualenvs\dsar-wm91-sfZ\lib\site-packages\plac_core.py", line 367, in call
    cmd, result = parser.consume(arglist)
File "C:\Users\fabio\.virtualenvs\dsar-wm91-sfZ\lib\site-packages\plac_core.py", line 232, in consume
    return cmd, self.func(*(args + varargs + extraopts), **kwargs)

============================== Warning message ==============================

✘ 'charmap' codec can't decode byte 0x9d in position 6990: character
maps to < undefined >

I ran the test suggested in that thread and had the same result - specifying encoding="utf-8" didnt raise the error but with out it I get a

UnicodeEncodeError: 'charmap' codec can't encode character '\u2139' in position 9: character maps to < undefined >

Is there something I can do on my end to work around this issue?

Thanks
Fabio

fabiolus · September 5, 2023, 7:06am

I upgraded to prodigy 1.13.1 but the issue still remains

koaning · September 6, 2023, 8:21am

Hi Fabio,

sorry for the late reply. We've not been able to reproduce this locally but that's possibly related to the fact that the Prodigy development team doesn't use Windows machines. However, you're not the first user with an issue that might be specific to Windows so we're exploring a way for internally to work on Windows bugfixes.

Will report back soon on this!

fabiolus · September 11, 2023, 3:59pm

Thanks Vincent,

Could it be a case of not explicitly specifying encoding="utf-8" while opening the jsonl file?

PEP 686 – Make UTF-8 mode default | peps.python.org.

Initially I converted my file from utf8 to cp1252 and that worked - but there would have been some chars that were lost.

I wasnt happy with the above so I did some more digging - Im using python 8.5 on my system and the locale.getpreferredencoding() returns 'cp1252'

Based on these:

I tried setting the default encoding used by Python to UTF-8, it worked for me by specifying -X utf8 after python but not when I tried to set the environment variable using

set PYTHONUTF8=1

The below call to NER.Correct worked for me

python -X utf8 -m prodigy ner.correct [..other arguments]

Im not sure if this may break something else. But for the purpose of just running ner.correct it works. I will update the thread in case I run into any issues.

Thanks
Fabio

koaning · September 12, 2023, 9:25am

Hi Fabio,

that seems like a solid solution for now, nice find!

It's still hard for us to fully replicate the issue, but I can't think of a reason why your approach wouldn't work for the short-medium term.

In the long term we're interested in replacing plac with radicli in our codebase. I'd imagine that once that change is in this issue should also go away, but it's something we'll keep in the back of our mind.

Balo · September 19, 2023, 9:07am

@fabiolus Thank you for this. Had the same issue and error message with input that worked flawlessly in older versions of prodigy.
Using "pyhton -X utf8" did the trick.

Topic		Replies	Views
Error while trying to train: 'utf-8' codec can't decode usage , solved , windows	4	1908	November 18, 2021
UnicodeEncodeError during training ner , spacy , solved	6	2092	November 13, 2018
JSONL files are not opening citing a charmap codec can't decode byte 0x9d	1	606	September 24, 2023
prodigy unable to read a greek character with an accent above it.	9	33	August 6, 2025
Encoding Error when running train-curve done , nightly , training	10	732	July 6, 2021

Encoding Error while running NER Correct on v1.13 on Windows 11

Related topics