UnicodeEncodeError during training

I started with actually a trained NER with 6 entities, the model was trained on a huge data set like 9 million words and over 250000 entities. Just want to check how Prodigy could improve the model and actually how good it might be on annotations we make a test with ner.teach. Strange error happen with just running ner.teach and I got the following error. keep in mind that I check the spacy model locally and it was fine. loaded in python 2.7.13 without any problem and find all the entities. I also check to upload a spacy model with empty NER and surprisingly it works, so by sure my model inside prodigy has a problem, we tested that model with different text and loaded it several times and there was not any problem. So the model locally loaded and it works super fine so I am confused where the problem might be. So please take a look and might see the issue. Here is the error:

Using 6 labels: .....
Traceback (most recent call last):
File “/usr/local/lib/python3.6/runpy.py”, line 193, in _run_module_as_main
“main”, mod_spec)
File “/usr/local/lib/python3.6/runpy.py”, line 85, in _run_code
exec(code, run_globals)
File “/usr/local/lib/python3.6/site-packages/prodigy/main.py”, line 259, in 
controller = recipe(args, use_plac=True)
File “cython_src/prodigy/core.pyx”, line 167, in prodigy.core.recipe.recipe_decorator.recipe_proxy
File “/usr/local/lib/python3.6/site-packages/plac_core.py”, line 328, in call
cmd, result = parser.consume(arglist)
File “/usr/local/lib/python3.6/site-packages/plac_core.py”, line 207, in consume
return cmd, self.func((args + varargs + extraopts), **kwargs)
File “/usr/local/lib/python3.6/site-packages/prodigy/recipes/ner.py”, line 90, in teach
nlp = spacy.load(spacy_model)
File “/usr/local/lib/python3.6/site-packages/spacy/init.py”, line 15, in load
return util.load_model(name, **overrides)
File “/usr/local/lib/python3.6/site-packages/spacy/util.py”, line 116, in load_model
return load_model_from_path(Path(name), **overrides)
File “/usr/local/lib/python3.6/site-packages/spacy/util.py”, line 156, in load_model_from_path
return nlp.from_disk(model_path)
File “/usr/local/lib/python3.6/site-packages/spacy/language.py”, line 653, in from_disk
util.from_disk(path, deserializers, exclude)
File “/usr/local/lib/python3.6/site-packages/spacy/util.py”, line 511, in from_disk
reader(path / key)
File “/usr/local/lib/python3.6/site-packages/spacy/language.py”, line 641, in 
self.vocab.from_disk§ and _fix_pretrained_vectors_name(self))),
File “vocab.pyx”, line 376, in spacy.vocab.Vocab.from_disk
File “strings.pyx”, line 215, in spacy.strings.StringStore.from_disk
File “strings.pyx”, line 248, in spacy.strings.StringStore._reset_and_load
File “strings.pyx”, line 130, in spacy.strings.StringStore.add
File “strings.pyx”, line 21, in spacy.strings.hash_string
UnicodeEncodeError: ‘utf-8’ codec can’t encode character ‘\udcf2’ in position 0: surrogates not allowed
root@39396ecbf955:/prodigy/data#
root@39396ecbf955:/prodigy/data# prodigy ner.teach swedish_ner_finance_train NER data/text/ner-swedish-finance.txt --label EVENT
Using 1 labels: EVENT
Traceback (most recent call last):
File “/usr/local/lib/python3.6/runpy.py”, line 193, in _run_module_as_main
“main”, mod_spec)
File “/usr/local/lib/python3.6/runpy.py”, line 85, in _run_code
exec(code, run_globals)
File “/usr/local/lib/python3.6/site-packages/prodigy/main.py”, line 259, in 
controller = recipe(args, use_plac=True)
File “cython_src/prodigy/core.pyx”, line 167, in prodigy.core.recipe.recipe_decorator.recipe_proxy
File “/usr/local/lib/python3.6/site-packages/plac_core.py”, line 328, in call
cmd, result = parser.consume(arglist)
File “/usr/local/lib/python3.6/site-packages/plac_core.py”, line 207, in consume
return cmd, self.func((args + varargs + extraopts), **kwargs)
File “/usr/local/lib/python3.6/site-packages/prodigy/recipes/ner.py”, line 90, in teach
nlp = spacy.load(spacy_model)
File “/usr/local/lib/python3.6/site-packages/spacy/init.py”, line 15, in load
return util.load_model(name, **overrides)
File “/usr/local/lib/python3.6/site-packages/spacy/util.py”, line 116, in load_model
return load_model_from_path(Path(name), **overrides)
File “/usr/local/lib/python3.6/site-packages/spacy/util.py”, line 156, in load_model_from_path
return nlp.from_disk(model_path)
File “/usr/local/lib/python3.6/site-packages/spacy/language.py”, line 653, in from_disk
util.from_disk(path, deserializers, exclude)
File “/usr/local/lib/python3.6/site-packages/spacy/util.py”, line 511, in from_disk
reader(path / key)
File “/usr/local/lib/python3.6/site-packages/spacy/language.py”, line 641, in 
self.vocab.from_disk§ and _fix_pretrained_vectors_name(self))),
File “vocab.pyx”, line 376, in spacy.vocab.Vocab.from_disk
File “strings.pyx”, line 215, in spacy.strings.StringStore.from_disk
File “strings.pyx”, line 248, in spacy.strings.StringStore._reset_and_load
File “strings.pyx”, line 130, in spacy.strings.StringStore.add
File “strings.pyx”, line 21, in spacy.strings.hash_string
UnicodeEncodeError: ‘utf-8’ codec can’t encode character ‘\udcf2’ in position 0: surrogates not allowed
root@39396ecbf955:/prodigy/data# cd NER/
root@39396ecbf955:/prodigy/data/NER# ls
accuracy.json meta.json ner tokenizer vocab
root@39396ecbf955:/prodigy/data/NER# cd …
root@39396ecbf955:/prodigy/data# prodigy ner.teach swedish_ner_finance_train NER data/text/ner-swedish-finance.txt --patterns data/pattern-files/patterns.jsonl --label 
Using 6 labels: EVENT, MORTGAGE_LOAN, CUSTOMER_SITUATION, PRODUCT, COMPANY, PERSON
Traceback (most recent call last):
File “/usr/local/lib/python3.6/runpy.py”, line 193, in _run_module_as_main
“main”, mod_spec)
File “/usr/local/lib/python3.6/runpy.py”, line 85, in _run_code
exec(code, run_globals)
File “/usr/local/lib/python3.6/site-packages/prodigy/main.py”, line 259, in 
controller = recipe(args, use_plac=True)
File “cython_src/prodigy/core.pyx”, line 167, in prodigy.core.recipe.recipe_decorator.recipe_proxy
File “/usr/local/lib/python3.6/site-packages/plac_core.py”, line 328, in call
cmd, result = parser.consume(arglist)
File “/usr/local/lib/python3.6/site-packages/plac_core.py”, line 207, in consume
return cmd, self.func((args + varargs + extraopts), **kwargs)
File “/usr/local/lib/python3.6/site-packages/prodigy/recipes/ner.py”, line 90, in teach
nlp = spacy.load(spacy_model)
File “/usr/local/lib/python3.6/site-packages/spacy/init.py”, line 15, in load
return util.load_model(name, **overrides)
File “/usr/local/lib/python3.6/site-packages/spacy/util.py”, line 116, in load_model
return load_model_from_path(Path(name), **overrides)
File “/usr/local/lib/python3.6/site-packages/spacy/util.py”, line 156, in load_model_from_path
return nlp.from_disk(model_path)
File “/usr/local/lib/python3.6/site-packages/spacy/language.py”, line 653, in from_disk
util.from_disk(path, deserializers, exclude)
File “/usr/local/lib/python3.6/site-packages/spacy/util.py”, line 511, in from_disk
reader(path / key)
File “/usr/local/lib/python3.6/site-packages/spacy/language.py”, line 641, in 
self.vocab.from_disk§ and _fix_pretrained_vectors_name(self))),
File “vocab.pyx”, line 376, in spacy.vocab.Vocab.from_disk
File “strings.pyx”, line 215, in spacy.strings.StringStore.from_disk
File “strings.pyx”, line 248, in spacy.strings.StringStore._reset_and_load
File “strings.pyx”, line 130, in spacy.strings.StringStore.add
File “strings.pyx”, line 21, in spacy.strings.hash_string
UnicodeEncodeError: ‘utf-8’ codec can’t encode character ‘\udcf2’ in position 0: surrogates not allowed
root@39396ecbf955:/prodigy/data# ls
NER pattern-files prodigy.db	text
root@39396ecbf955:/prodigy/data# prodigy ner.teach swedish_ner_finance_train NER/ data/text/ner-swedish-finance.txt --patterns data/pattern-files/patterns.jsonl --label EVENT,MORTGAGE_LOAN,CUSTOMER_SITUATION,PRODUCT,COMPANY,PERSON
Using 6 labels: EVENT, MORTGAGE_LOAN, CUSTOMER_SITUATION, PRODUCT, COMPANY, PERSON
Traceback (most recent call last):
File “/usr/local/lib/python3.6/runpy.py”, line 193, in _run_module_as_main
“main”, mod_spec)
File “/usr/local/lib/python3.6/runpy.py”, line 85, in _run_code
exec(code, run_globals)
File “/usr/local/lib/python3.6/site-packages/prodigy/main.py”, line 259, in 
controller = recipe(args, use_plac=True)
File “cython_src/prodigy/core.pyx”, line 167, in prodigy.core.recipe.recipe_decorator.recipe_proxy
File “/usr/local/lib/python3.6/site-packages/plac_core.py”, line 328, in call
cmd, result = parser.consume(arglist)
File “/usr/local/lib/python3.6/site-packages/plac_core.py”, line 207, in consume
return cmd, self.func((args + varargs + extraopts), **kwargs)
File “/usr/local/lib/python3.6/site-packages/prodigy/recipes/ner.py”, line 90, in teach
nlp = spacy.load(spacy_model)
File “/usr/local/lib/python3.6/site-packages/spacy/init.py”, line 15, in load
return util.load_model(name, **overrides)
File “/usr/local/lib/python3.6/site-packages/spacy/util.py”, line 116, in load_model
return load_model_from_path(Path(name), **overrides)
File “/usr/local/lib/python3.6/site-packages/spacy/util.py”, line 156, in load_model_from_path
return nlp.from_disk(model_path)
File “/usr/local/lib/python3.6/site-packages/spacy/language.py”, line 653, in from_disk
util.from_disk(path, deserializers, exclude)
File “/usr/local/lib/python3.6/site-packages/spacy/util.py”, line 511, in from_disk
reader(path / key)
File “/usr/local/lib/python3.6/site-packages/spacy/language.py”, line 641, in 
self.vocab.from_disk§ and _fix_pretrained_vectors_name(self))),
File “vocab.pyx”, line 376, in spacy.vocab.Vocab.from_disk
File “strings.pyx”, line 215, in spacy.strings.StringStore.from_disk
File “strings.pyx”, line 248, in spacy.strings.StringStore._reset_and_load
File “strings.pyx”, line 130, in spacy.strings.StringStore.add
File “strings.pyx”, line 21, in spacy.strings.hash_string
UnicodeEncodeError: ‘utf-8’ codec can’t encode character ‘\udcf2’ in position 0: surrogates not allowed

thanks

Hi @kasra,

It looks like there’s some encoding issues. Could you check your system locale with the locale command and paste the output? I suspect you might not have utf8 encoding by default, which can mess things up if things are being piped through stdin or stdout.

Best,
Matt

Hi Matt,
this is the respond:
root@39396ecbf955:/prodigy# locale
LANG=C.UTF-8
LANGUAGE=
LC_CTYPE=“C.UTF-8”
LC_NUMERIC=“C.UTF-8”
LC_TIME=“C.UTF-8”
LC_COLLATE=“C.UTF-8”
LC_MONETARY=“C.UTF-8”
LC_MESSAGES=“C.UTF-8”
LC_PAPER=“C.UTF-8”
LC_NAME=“C.UTF-8”
LC_ADDRESS=“C.UTF-8”
LC_TELEPHONE=“C.UTF-8”
LC_MEASUREMENT=“C.UTF-8”
LC_IDENTIFICATION=“C.UTF-8”
LC_ALL=

Yeah I think LANGUAGE and LC_ALL needs to be set. If you do export LC_ALL=C.UTF-8 and rerun, does it work?

I came to this thread with suggested solutions, especially editing /etc/environment or /etc/default/locale: https://askubuntu.com/questions/162391/how-do-i-fix-my-locale-issue . I’m not sure if the advice is still current, or whether it’s appropriate for your distribution.

The following should work, see this thread for details:

export LC_ALL=en_US.UTF-8
export LANG=en_US.UTF-8

Hi Ines and Matt

Thanks for your prompt reply the issue was solved. Just one question for me the Spacy Model I trained and used does not include Tagger, does this have an effect on Prodigy quality or not?

Thanks a lot

No, in spaCy v2, the pipeline components are independent and can be trained separately. So whether the pipeline has a tagger or not doesn't matter for training the entity recognizer.