UnicodeDecodeError while training japanese model

spacy
usage

(Kavya Gujjala) #1

Hi,

I wanted to train a japanese blank model using ner.manual command.
But I am getting encoding error . Does anything have to exported like you have mentioned for english language model?

command used

prodigy ner.manual test jap_model_vm_1 out1.txt --label ORG

out1.txt file have japanese text

Error looks like

Using 1 labels: ORG
Traceback (most recent call last):
  File "/usr/local/conda3/lib/python3.6/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/usr/local/conda3/lib/python3.6/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/usr/local/conda3/lib/python3.6/site-packages/prodigy/__main__.py", line 259, in <module>
    controller = recipe(*args, use_plac=True)
  File "cython_src/prodigy/core.pyx", line 178, in prodigy.core.recipe.recipe_decorator.recipe_proxy
  File "cython_src/prodigy/core.pyx", line 55, in prodigy.core.Controller.__init__
  File "/usr/local/conda3/lib/python3.6/site-packages/toolz/itertoolz.py", line 368, in first
    return next(iter(seq))
  File "cython_src/prodigy/core.pyx", line 84, in iter_tasks
  File "cython_src/prodigy/components/preprocess.pyx", line 107, in add_tokens
  File "/usr/local/conda3/lib/python3.6/site-packages/spacy/lang/ja/__init__.py", line 117, in make_doc
    return self.tokenizer(text)
  File "/usr/local/conda3/lib/python3.6/site-packages/spacy/lang/ja/__init__.py", line 79, in __call__
    dtokens = detailed_tokens(self.tokenizer, text)
  File "/usr/local/conda3/lib/python3.6/site-packages/spacy/lang/ja/__init__.py", line 60, in detailed_tokens
    parts = node.feature.split(',')
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb5 in position 0: invalid start byte

What can be the reason?


(Ines Montani) #2

This usually indicates that the encoding of the text file isn’t valid utf-8 (unicode). Could you try explicitly chaning the encoding to utf-8? For example from the command line using a tool like iconv or in your text editor (e.g. in Visual Studio Code: Change encoding > UTF-8).


(Kavya Gujjala) #3

Hi,
Thanks for the reply.

Tried using iconv as you said even then the same error is coming up.

I have trained with 1500 sentences of japanese text on blank model and using it in prodigy as base model.

command used:

prodigy ner.teach jap_dataset_try_1 japanese/models/jap_model_5_iter_drop_0.1 japanese/train/jap_sentences.txt

Traceback (most recent call last):
  File "/usr/local/conda3/lib/python3.6/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/usr/local/conda3/lib/python3.6/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/usr/local/conda3/lib/python3.6/site-packages/prodigy/__main__.py", line 259, in <module>
    controller = recipe(*args, use_plac=True)
  File "cython_src/prodigy/core.pyx", line 178, in prodigy.core.recipe.recipe_decorator.recipe_proxy
  File "cython_src/prodigy/core.pyx", line 55, in prodigy.core.Controller.__init__
  File "/usr/local/conda3/lib/python3.6/site-packages/toolz/itertoolz.py", line 368, in first
    return next(iter(seq))
  File "cython_src/prodigy/core.pyx", line 84, in iter_tasks
  File "cython_src/prodigy/components/sorters.pyx", line 136, in __iter__
  File "cython_src/prodigy/components/sorters.pyx", line 51, in genexpr
  File "cython_src/prodigy/models/ner.pyx", line 265, in __call__
  File "cython_src/prodigy/models/ner.pyx", line 233, in get_tasks
  File "cytoolz/itertoolz.pyx", line 1047, in cytoolz.itertoolz.partition_all.__next__
  File "cython_src/prodigy/models/ner.pyx", line 192, in predict_spans
  File "cytoolz/itertoolz.pyx", line 1047, in cytoolz.itertoolz.partition_all.__next__
  File "cython_src/prodigy/components/preprocess.pyx", line 36, in split_sentences
  File "/usr/local/conda3/lib/python3.6/site-packages/spacy/language.py", line 548, in pipe
    for doc, context in izip(docs, contexts):
  File "/usr/local/conda3/lib/python3.6/site-packages/spacy/language.py", line 572, in pipe
    for doc in docs:
  File "nn_parser.pyx", line 367, in pipe
  File "cytoolz/itertoolz.pyx", line 1047, in cytoolz.itertoolz.partition_all.__next__
  File "/usr/local/conda3/lib/python3.6/site-packages/spacy/language.py", line 746, in _pipe
    for doc in docs:
  File "/usr/local/conda3/lib/python3.6/site-packages/spacy/language.py", line 551, in <genexpr>
    docs = (self.make_doc(text) for text in texts)
  File "/usr/local/conda3/lib/python3.6/site-packages/spacy/lang/ja/__init__.py", line 117, in make_doc
    return self.tokenizer(text)
  File "/usr/local/conda3/lib/python3.6/site-packages/spacy/lang/ja/__init__.py", line 79, in __call__
    dtokens = detailed_tokens(self.tokenizer, text)
  File "/usr/local/conda3/lib/python3.6/site-packages/spacy/lang/ja/__init__.py", line 60, in detailed_tokens
    parts = node.feature.split(',')
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xbb in position 2: invalid start byte
Exception ignored in: <generator object at 0x7fdd2f97b288>
SystemError: <built-in function delete_Tagger> returned a result with an error set

I got this error.

Can you please help me with this?


(Ines Montani) #4

It looks like there’s a problem with pickling the external library that’s optional for Japanese tokenization. So this is more something in spaCy. See here for details: