How is the support for Languages other than English?

jiuren · July 24, 2018, 6:03am

Hi,
First of all, really nice work!

I am curious about the support for languages other than English, especially for CJK languages?

I couldn’t find any clue about that from the online demo.

Thanks in advance!

Best,
Wayne

honnibal · July 24, 2018, 8:46am

Prodigy uses spaCy for NLP by default, although you can also change this, and write recipes that use any other NLP library instead.

We don’t have pre-trained NER models for CJK languages in spaCy yet, but we have segmentation for Chinese and Japanese based on third-party libraries. For text classification, I would expect everything to work fine.

I would suggest giving the CJK support in spaCy a try. If you find that works OK, you’ll probably find Prodigy works well too.

ines · July 24, 2018, 11:02am

To add to @honnibal's comment above, here's a thread that shows an example of using Prodigy with languages that spaCy doesn't yet provide pre-trained models for (in this case, to train a Norwegian text classifier):

And this thread discusses using Prodigy to add NER and text classification capabilities to a Chinese spaCy model (which, according to the user, seems to have worked well):

chenrui2436 · March 17, 2020, 3:57am

About prodigy，i want to know whether the prodigy support Chinese NER like English. i found that if i apply prodigy on Chinese，like English，the result i got from 8080 has no Chinese sentences or words. Only i give the chinese dataset to prodigy， use ner.manual recipe to annotation by manual，and use ner.teach recipe to choose right or wrong. if there is other support for me to got NER more intelligence.

ines · March 17, 2020, 9:58am

We have several users who work with Chinese in Prodigy, so yes! We just don't have a pretrained Chinese NER model for spaCy, so you need to do that part yourself.

If you're working with Chinese, you need to use a Chinese base model with a Chinese tokenizer. See the links posted above for an example of how to save out a base model and tokenizer for a different language. You can use spacy.blank("zh"), save it to disk and use that as the base model for annotation. Maybe you can also use some match patterns to make the initial annotation faster.

Topic		Replies	Views
Support for Japanese annotation in Prodigy ner , spacy	1	911	September 2, 2019
Can it work on Traditional Chinese or Simplified Chinese? usage	1	844	September 25, 2018
Multilingual support? usage , solved	1	925	February 14, 2019
Working with languages not yet supported by Spacy textcat , spacy , solved	18	7221	June 25, 2018
Turkish language that spaCy doesn’t yet provide pre-trained models usage , spacy	3	1669	January 23, 2020

How is the support for Languages other than English?

Related topics