unicode issue during the training

Kasra · December 4, 2018, 4:01pm

Hi
I again got the same error, I follow your previous comment and make the "locale" command and everything seems fine. I also check this:
"import locale
print(locale.getlocale())"
and this also is fine. Everything is set to: “en_US.UTF-8" but I still the following error.

root@8bc7577bb360:/prodigy# locale
LANG=en_US.UTF-8
LANGUAGE=en_US.UTF-8
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_PAPER="en_US.UTF-8"
LC_NAME="en_US.UTF-8"
LC_ADDRESS="en_US.UTF-8"
LC_TELEPHONE="en_US.UTF-8"
LC_MEASUREMENT="en_US.UTF-8"
LC_IDENTIFICATION="en_US.UTF-8"
LC_ALL=en_US.UTF-8
root@8bc7577bb360:/prodigy# python
Python 3.6.7 (default, Nov 16 2018, 22:39:40)
[GCC 4.9.2] on linux
Type "help", "copyright", "credits" or "license" for more information.

import spacy
we use spacy 2.0.12

nlp = spacy.load('/prodigy/data/NER')
Traceback (most recent call last):
File "", line 1, in
File "/usr/local/lib/python3.6/site-packages/spacy/init.py", line 15, in load
return util.load_model(name, **overrides)
File "/usr/local/lib/python3.6/site-packages/spacy/util.py", line 116, in load_model
return load_model_from_path(Path(name), **overrides)
File "/usr/local/lib/python3.6/site-packages/spacy/util.py", line 156, in load_model_from_path
return nlp.from_disk(model_path)
File "/usr/local/lib/python3.6/site-packages/spacy/language.py", line 653, in from_disk
util.from_disk(path, deserializers, exclude)
File "/usr/local/lib/python3.6/site-packages/spacy/util.py", line 511, in from_disk
reader(path / key)
File "/usr/local/lib/python3.6/site-packages/spacy/language.py", line 641, in
self.vocab.from_disk(p) and _fix_pretrained_vectors_name(self))),
File "vocab.pyx", line 376, in spacy.vocab.Vocab.from_disk
File "strings.pyx", line 215, in spacy.strings.StringStore.from_disk
File "strings.pyx", line 248, in spacy.strings.StringStore._reset_and_load
File "strings.pyx", line 130, in spacy.strings.StringStore.add
File "strings.pyx", line 21, in spacy.strings.hash_string
UnicodeEncodeError: 'utf-8' codec can't encode character '\udcf2' in position 0: surrogates not allowed

honnibal · December 5, 2018, 1:54am

It seems that’s a utf-16 representation of an invalid character marker: https://www.charbase.com/dcf2-unicode-invalid-character

So, the question is how that’s ended up in the model. Could it be that when the model was trained, you had a locale issue, and that meant the data was corrupted somewhere, leading to bad data being saved?

Kasra · December 5, 2018, 6:33am

Hi, Matt

The strange thing is that when I load the same model locally with in Spacy everything is OK.
How could that happen? I checked it and use the same trained model in Swedish with in Spacy for about two month now.
I dig the problem for a while and wonder if you train the model in Python 2.7 (which I made) and then use the model with in Prodigy with Python 3 is it fine? So we trained the model on python 2.7 and we are trying to run it on python 3.6… if this could be an issue?
the error seems to be when spacy tries to read from the vectors.
When I try to install prodigy on python 2.7, i get this error:

pip install prodigy-1.5.1-cp35.cp36-cp35m.cp36m-linux_x86_64.whl

prodigy-1.5.1-cp35.cp36-cp35m.cp36m-linux_x86_64.whl is not a supported wheel on this platform.

so probably prodigy is not even supported in python 2.7.

Thanks

ines · December 5, 2018, 12:57pm

Is it possible that there's an invalid character in the vectors you've added? This could explain why you don't come across this problem when you just use the model in spaCy, but it does occur when the vectors are loaded during training?

Yes, this should be no problem at all. We train all the spaCy model's we distribute on Python 3, but they're still compatible with Python 2.

Yes, Prodigy only supports Python 3.5+. Since it's a developer tool and the models you train are still cross-compatible, there's not really a need for it to support (pretty much legacy) Python 2.7.

Topic		Replies	Views
UnicodeEncodeError during training ner , spacy , solved	6	2091	November 13, 2018
UnicodeDecodeError while training japanese model usage , spacy	3	1663	February 6, 2019
Error while trying to train: 'utf-8' codec can't decode usage , solved , windows	4	1902	November 18, 2021
Can't import spacy model from prodigy train recipe output Getting Started usage , spacy , solved	5	1241	June 24, 2021
en_vectors_web_lg loading issue usage , spacy , solved	1	1117	May 13, 2020

unicode issue during the training

pip install prodigy-1.5.1-cp35.cp36-cp35m.cp36m-linux_x86_64.whl

Related topics