Support for Japanese NER support in spacy!

Hi,

Do we have a Japanese NER Support in spacy?

Yes – see here for the list of supported languages:

Japanese is supported via a third-party dependency, mecab. So you need to have that installed as well. But once you're set up, you'll be able to save out a blank model and add components to it – for example, train a named entity recognizer or text classifier using Prodigy. (We actually have several users training Chinese models with Prodigy, so the process for Japanese should be similar.)

To save out a blank base model, you can run the following:

import spacy

nlp = spacy.blank('ja')
nlp.to_disk('/path/to/model')

You can then use the model directory as the base model in Prodigy :slightly_smiling_face:

Hi,

I made the blank Japanese model… but how to tag via ner.teach recipe or ner.make-gold recipe as we are not familiar with the Japanese language.

Well…There’s really no solution there! You definitely need Japanese knowledge to create Japanese annotations.

Hi,
I have saved the Japanese Blank model and trying to use ner.teach recipe with blank model.

prodigy ner.teach skn_jap_1 j-model2 skincare_reviews_jap/skincare_reviews_ar_jap_1.txt --label SKINCARE --patterns skincare_ner_new2.jsonl

But this comand is throwing this error.
TypeError: can't pickle SwigPyObject objects
Verbose Log

File “/usr/local/conda3/lib/python3.6/runpy.py”, line 193, in _run_module_as_main
main”, mod_spec)
File “/usr/local/conda3/lib/python3.6/runpy.py”, line 85, in _run_code
exec(code, run_globals)
File “/usr/local/conda3/lib/python3.6/site-packages/prodigy/main.py”, line 259, in
controller = recipe(args, use_plac=True)
File “cython_src/prodigy/core.pyx”, line 167, in prodigy.core.recipe.recipe_decorator.recipe_proxy
File “/usr/local/conda3/lib/python3.6/site-packages/plac_core.py”, line 328, in call
cmd, result = parser.consume(arglist)
File “/usr/local/conda3/lib/python3.6/site-packages/plac_core.py”, line 207, in consume
return cmd, self.func(
(args + varargs + extraopts), **kwargs)
File “/usr/local/conda3/lib/python3.6/site-packages/prodigy/recipes/ner.py”, line 92, in teach
model = EntityRecognizer(nlp, label=label)
File “cython_src/prodigy/models/ner.pyx”, line 165, in prodigy.models.ner.EntityRecognizer.init
File “/usr/local/conda3/lib/python3.6/copy.py”, line 180, in deepcopy
y = _reconstruct(x, memo, *rv)
File “/usr/local/conda3/lib/python3.6/copy.py”, line 280, in _reconstruct
state = deepcopy(state, memo)
File “/usr/local/conda3/lib/python3.6/copy.py”, line 150, in deepcopy
y = copier(x, memo)
File “/usr/local/conda3/lib/python3.6/copy.py”, line 240, in _deepcopy_dict
y[deepcopy(key, memo)] = deepcopy(value, memo)
File “/usr/local/conda3/lib/python3.6/copy.py”, line 180, in deepcopy
y = _reconstruct(x, memo, *rv)
File “/usr/local/conda3/lib/python3.6/copy.py”, line 280, in _reconstruct
state = deepcopy(state, memo)
File “/usr/local/conda3/lib/python3.6/copy.py”, line 150, in deepcopy
y = copier(x, memo)
File “/usr/local/conda3/lib/python3.6/copy.py”, line 240, in _deepcopy_dict
y[deepcopy(key, memo)] = deepcopy(value, memo)
File “/usr/local/conda3/lib/python3.6/copy.py”, line 180, in deepcopy
y = _reconstruct(x, memo, *rv)
File “/usr/local/conda3/lib/python3.6/copy.py”, line 280, in _reconstruct
state = deepcopy(state, memo)
File “/usr/local/conda3/lib/python3.6/copy.py”, line 150, in deepcopy
y = copier(x, memo)
File “/usr/local/conda3/lib/python3.6/copy.py”, line 240, in _deepcopy_dict
y[deepcopy(key, memo)] = deepcopy(value, memo)
File “/usr/local/conda3/lib/python3.6/copy.py”, line 169, in deepcopy
rv = reductor(4)
TypeError: can’t pickle SwigPyObject objects

It looks like there’s a problem with pickling the external library that’s optional for Japanese tokenization. We hadn’t seen this before, but will definitely look into it.

In the meantime, I think the following workaround should work to avoid the problem. I haven’t tested it myself as I’m using a machine it’s hard to install mecab on currently, so apologies if a detail of this is incorrect.

from spacy.lang.ja import JapaneseTokenizer
import copyreg

def pickle_ja_tokenizer(instance):
    return JapaneseTokenizer, tuple()

copyreg.pickle(JapaneseTokenizer, pickle_ja_tokenizer)

The idea here is to use the copyreg module to instruct Python on how to copy the object.

I am using japanese blank model with Mecab

i used ner.manual recipe

prodigy ner.manual dataset_test_1 path/to/txt/file --label NORP,GPE,ORG,MONEY,TIME

but after running this i got this error:

Using 5 labels: NORP,GPE,ORG,MONEY,TIME
Added dataset dataset_test_1 to database SQLite.
StopIteration

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/nihar/local/conda3/lib/python3.6/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/nihar/local/conda3/lib/python3.6/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/nihar/local/conda3/lib/python3.6/site-packages/prodigy/__main__.py", line 259, in <module>
    controller = recipe(*args, use_plac=True)
  File "cython_src/prodigy/core.pyx", line 178, in prodigy.core.recipe.recipe_decorator.recipe_proxy
  File "cython_src/prodigy/core.pyx", line 55, in prodigy.core.Controller.__init__
  File "/nihar/local/conda3/lib/python3.6/site-packages/toolz/itertoolz.py", line 368, in first
    return next(iter(seq))
  File "cython_src/prodigy/core.pyx", line 84, in iter_tasks
SystemError: <built-in function delete_Tagger> returned a result with an error set

Swig == 3.0.12
mecab-python3 == 0.996.1

Hi! This already came up in another thread – see here for details and a possible solution:

It might be better to have an issue on this in the spaCy tracker, maybe there's something we can do in spaCy to prevent this.

Thank you for the reply

The link shows the error about swigpy object

I’m getting :

SystemError: <built-in function delete_Tagger> returned a result with an error set

Is this the same issue?

It would be of great help if you can help me with this, because I’m stuck there and not able to work on japanese model !! :neutral_face: