Problem with Japanese NER

Hi, I just started prodigy and now working on processing Japanese sentence.
I was able to generate a Japanese spaCy model and succeeded to do a text classification.
However, named entity recognition doesn't work well.

I made a simple Japanese spaCy model based on this document.


The only thing I changed is changing name of the language.
(nlp = spacy.blank('en') to nlp = spacy.blank('ja'))

The model was exported successfully.
And the text classification task works well.

prodigy textcat.teach my_db ja_spacy_model text_data.jsonl --label TEST

However, the named entity task shutted up with error.

prodigy ner.teach my_db ja_spacy_model text_data.jsonl
TypeError: can't pickle _thread.lock objects

These are vorbose logs.

21:19:58 - RECIPE: Calling recipe 'ner.teach'
21:19:58 - RECIPE: Starting recipe ner.teach
{'unsegmented': False, 'exclude': None, 'patterns': None, 'label': None, 'loader': None, 'api': None, 'source': 'text_data.jsonl', 'spacy_model': 'ja_spacy_model', 'dataset': 'my_db'}

21:19:58 - LOADER: Using file extension 'jsonl' to find loader
text_data.jsonl

21:19:58 - LOADER: Loading stream from jsonl
21:19:58 - LOADER: Rehashing stream
21:19:59 - RECIPE: Creating EntityRecognizer using model ja_spacy_model
Traceback (most recent call last):
  File "/Users/eqsuke/.pyenv/versions/3.6.2/lib/python3.6/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/Users/eqsuke/.pyenv/versions/3.6.2/lib/python3.6/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/Users/eqsuke/.pyenv/versions/3.6.2/lib/python3.6/site-packages/prodigy/__main__.py", line 259, in <module>
    controller = recipe(*args, use_plac=True)
  File "cython_src/prodigy/core.pyx", line 167, in prodigy.core.recipe.recipe_decorator.recipe_proxy
  File "/Users/eqsuke/.pyenv/versions/3.6.2/lib/python3.6/site-packages/plac_core.py", line 328, in call
    cmd, result = parser.consume(arglist)
  File "/Users/eqsuke/.pyenv/versions/3.6.2/lib/python3.6/site-packages/plac_core.py", line 207, in consume
    return cmd, self.func(*(args + varargs + extraopts), **kwargs)
  File "/Users/eqsuke/.pyenv/versions/3.6.2/lib/python3.6/site-packages/prodigy/recipes/ner.py", line 92, in teach
    model = EntityRecognizer(nlp, label=label)
  File "cython_src/prodigy/models/ner.pyx", line 165, in prodigy.models.ner.EntityRecognizer.__init__
  File "/Users/eqsuke/.pyenv/versions/3.6.2/lib/python3.6/copy.py", line 180, in deepcopy
    y = _reconstruct(x, memo, *rv)
  File "/Users/eqsuke/.pyenv/versions/3.6.2/lib/python3.6/copy.py", line 280, in _reconstruct
    state = deepcopy(state, memo)
  File "/Users/eqsuke/.pyenv/versions/3.6.2/lib/python3.6/copy.py", line 150, in deepcopy
    y = copier(x, memo)
  File "/Users/eqsuke/.pyenv/versions/3.6.2/lib/python3.6/copy.py", line 240, in _deepcopy_dict
    y[deepcopy(key, memo)] = deepcopy(value, memo)
  File "/Users/eqsuke/.pyenv/versions/3.6.2/lib/python3.6/copy.py", line 180, in deepcopy
    y = _reconstruct(x, memo, *rv)
  File "/Users/eqsuke/.pyenv/versions/3.6.2/lib/python3.6/copy.py", line 280, in _reconstruct
    state = deepcopy(state, memo)
  File "/Users/eqsuke/.pyenv/versions/3.6.2/lib/python3.6/copy.py", line 150, in deepcopy
    y = copier(x, memo)
  File "/Users/eqsuke/.pyenv/versions/3.6.2/lib/python3.6/copy.py", line 240, in _deepcopy_dict
    y[deepcopy(key, memo)] = deepcopy(value, memo)
  File "/Users/eqsuke/.pyenv/versions/3.6.2/lib/python3.6/copy.py", line 180, in deepcopy
    y = _reconstruct(x, memo, *rv)
  File "/Users/eqsuke/.pyenv/versions/3.6.2/lib/python3.6/copy.py", line 280, in _reconstruct
    state = deepcopy(state, memo)
  File "/Users/eqsuke/.pyenv/versions/3.6.2/lib/python3.6/copy.py", line 150, in deepcopy
    y = copier(x, memo)
  File "/Users/eqsuke/.pyenv/versions/3.6.2/lib/python3.6/copy.py", line 240, in _deepcopy_dict
    y[deepcopy(key, memo)] = deepcopy(value, memo)
  File "/Users/eqsuke/.pyenv/versions/3.6.2/lib/python3.6/copy.py", line 180, in deepcopy
    y = _reconstruct(x, memo, *rv)
  File "/Users/eqsuke/.pyenv/versions/3.6.2/lib/python3.6/copy.py", line 280, in _reconstruct
    state = deepcopy(state, memo)
  File "/Users/eqsuke/.pyenv/versions/3.6.2/lib/python3.6/copy.py", line 150, in deepcopy
    y = copier(x, memo)
  File "/Users/eqsuke/.pyenv/versions/3.6.2/lib/python3.6/copy.py", line 240, in _deepcopy_dict
    y[deepcopy(key, memo)] = deepcopy(value, memo)
  File "/Users/eqsuke/.pyenv/versions/3.6.2/lib/python3.6/copy.py", line 180, in deepcopy
    y = _reconstruct(x, memo, *rv)
  File "/Users/eqsuke/.pyenv/versions/3.6.2/lib/python3.6/copy.py", line 280, in _reconstruct
    state = deepcopy(state, memo)
  File "/Users/eqsuke/.pyenv/versions/3.6.2/lib/python3.6/copy.py", line 150, in deepcopy
    y = copier(x, memo)
  File "/Users/eqsuke/.pyenv/versions/3.6.2/lib/python3.6/copy.py", line 240, in _deepcopy_dict
    y[deepcopy(key, memo)] = deepcopy(value, memo)
  File "/Users/eqsuke/.pyenv/versions/3.6.2/lib/python3.6/copy.py", line 169, in deepcopy
    rv = reductor(4)

I searched this kind of error in the forum but I couldn't find it.
Since the textcat.teach working well, I think this error is not rooted from spaCy but progidy and might occur only Japanese NER.
Do I have wrong process? What should I do to fix it.

Thanks in advance.

Thanks for reporting this. I’m sorry for the delay replying.

The bug is occurring because the EntityRecognizer tries to make a copy of the nlp object, which ends up calling into the pickle module. It appears that the Janome tokenizer which is being used for Japanese doesn’t support the Pickle protocol well, triggering the error.

As a temporary workaround, could you try making the following modification to the /Users/eqsuke/.pyenv/versions/3.6.2/lib/python3.6/site-packages/prodigy/recipes/ner.py file.

On line 92, replace the call to model = EntityRecognizer(nlp, label=label) with:

tokenizer = nlp.tokenizer
nlp.tokenizer = None
model = EntityRecognizer(nlp, label=label)
model.nlp.tokenizer = tokenizer
model.orig_nlp.tokenizer = tokenizer

In the next release of spaCy we are switching from Janome to the MeCab tokenizer for Japanese, which reportedly has better results. I wonder whether the same bug occurs. We should have a test case in spaCy to check that the same error doesn’t occur.

2 Likes

Thank you very much.
The problem has been solved :grinning:

I think the same issue occurs in batch-train.
I'll try to apply the same replacement.

In the next release of spaCy we are switching from Janome to the MeCab tokenizer for Japanese

Sounds great.
I hope this won't be occurred in the next release.

Thanks again.