unicode issue during the training

usage
spacy

(Kasra) #1

Hi
I again got the same error, I follow your previous comment and make the “locale” command and everything seems fine. I also check this:
“import locale
print(locale.getlocale())”
and this also is fine. Everything is set to: “en_US.UTF-8" but I still the following error.

root@8bc7577bb360:/prodigy# locale
LANG=en_US.UTF-8
LANGUAGE=en_US.UTF-8
LC_CTYPE=“en_US.UTF-8”
LC_NUMERIC=“en_US.UTF-8”
LC_TIME=“en_US.UTF-8”
LC_COLLATE=“en_US.UTF-8”
LC_MONETARY=“en_US.UTF-8”
LC_MESSAGES=“en_US.UTF-8”
LC_PAPER=“en_US.UTF-8”
LC_NAME=“en_US.UTF-8”
LC_ADDRESS=“en_US.UTF-8”
LC_TELEPHONE=“en_US.UTF-8”
LC_MEASUREMENT=“en_US.UTF-8”
LC_IDENTIFICATION=“en_US.UTF-8”
LC_ALL=en_US.UTF-8
root@8bc7577bb360:/prodigy# python
Python 3.6.7 (default, Nov 16 2018, 22:39:40)
[GCC 4.9.2] on linux
Type “help”, “copyright”, “credits” or “license” for more information.

import spacy
we use spacy 2.0.12

nlp = spacy.load(’/prodigy/data/NER’)
Traceback (most recent call last):
File “”, line 1, in
File “/usr/local/lib/python3.6/site-packages/spacy/init.py”, line 15, in load
return util.load_model(name, **overrides)
File “/usr/local/lib/python3.6/site-packages/spacy/util.py”, line 116, in load_model
return load_model_from_path(Path(name), **overrides)
File “/usr/local/lib/python3.6/site-packages/spacy/util.py”, line 156, in load_model_from_path
return nlp.from_disk(model_path)
File “/usr/local/lib/python3.6/site-packages/spacy/language.py”, line 653, in from_disk
util.from_disk(path, deserializers, exclude)
File “/usr/local/lib/python3.6/site-packages/spacy/util.py”, line 511, in from_disk
reader(path / key)
File “/usr/local/lib/python3.6/site-packages/spacy/language.py”, line 641, in
self.vocab.from_disk§ and _fix_pretrained_vectors_name(self))),
File “vocab.pyx”, line 376, in spacy.vocab.Vocab.from_disk
File “strings.pyx”, line 215, in spacy.strings.StringStore.from_disk
File “strings.pyx”, line 248, in spacy.strings.StringStore._reset_and_load
File “strings.pyx”, line 130, in spacy.strings.StringStore.add
File “strings.pyx”, line 21, in spacy.strings.hash_string
UnicodeEncodeError: ‘utf-8’ codec can’t encode character ‘\udcf2’ in position 0: surrogates not allowed


(Matthew Honnibal) #2

It seems that’s a utf-16 representation of an invalid character marker: https://www.charbase.com/dcf2-unicode-invalid-character

So, the question is how that’s ended up in the model. Could it be that when the model was trained, you had a locale issue, and that meant the data was corrupted somewhere, leading to bad data being saved?


(Kasra) #3

Hi, Matt

The strange thing is that when I load the same model locally with in Spacy everything is OK.
How could that happen? I checked it and use the same trained model in Swedish with in Spacy for about two month now.
I dig the problem for a while and wonder if you train the model in Python 2.7 (which I made) and then use the model with in Prodigy with Python 3 is it fine? So we trained the model on python 2.7 and we are trying to run it on python 3.6… if this could be an issue?
the error seems to be when spacy tries to read from the vectors.
When I try to install prodigy on python 2.7, i get this error:

pip install prodigy-1.5.1-cp35.cp36-cp35m.cp36m-linux_x86_64.whl

prodigy-1.5.1-cp35.cp36-cp35m.cp36m-linux_x86_64.whl is not a supported wheel on this platform.

so probably prodigy is not even supported in python 2.7.

Thanks


(Ines Montani) #4

Is it possible that there’s an invalid character in the vectors you’ve added? This could explain why you don’t come across this problem when you just use the model in spaCy, but it does occur when the vectors are loaded during training?

Yes, this should be no problem at all. We train all the spaCy model’s we distribute on Python 3, but they’re still compatible with Python 2.

Yes, Prodigy only supports Python 3.5+. Since it’s a developer tool and the models you train are still cross-compatible, there’s not really a need for it to support (pretty much legacy) Python 2.7.