How is the support for Languages other than English?

Hi,
First of all, really nice work!

I am curious about the support for languages other than English, especially for CJK languages?

I couldn’t find any clue about that from the online demo.

Thanks in advance!

Best,
Wayne

Prodigy uses spaCy for NLP by default, although you can also change this, and write recipes that use any other NLP library instead.

We don’t have pre-trained NER models for CJK languages in spaCy yet, but we have segmentation for Chinese and Japanese based on third-party libraries. For text classification, I would expect everything to work fine.

I would suggest giving the CJK support in spaCy a try. If you find that works OK, you’ll probably find Prodigy works well too.

To add to @honnibal’s comment above, here’s a thread that shows an example of using Prodigy with languages that spaCy doesn’t yet provide pre-trained models for (in this case, to train a Norwegian text classifier):

And this thread discusses using Prodigy to add NER and text classification capabilities to a Chinese spaCy model (which, according to the user, seems to have worked well):

About prodigy,i want to know whether the prodigy support Chinese NER like English. i found that if i apply prodigy on Chinese,like English,the result i got from 8080 has no Chinese sentences or words. Only i give the chinese dataset to prodigy, use ner.manual recipe to annotation by manual,and use ner.teach recipe to choose right or wrong. if there is other support for me to got NER more intelligence.

We have several users who work with Chinese in Prodigy, so yes! We just don't have a pretrained Chinese NER model for spaCy, so you need to do that part yourself.

If you're working with Chinese, you need to use a Chinese base model with a Chinese tokenizer. See the links posted above for an example of how to save out a base model and tokenizer for a different language. You can use spacy.blank("zh"), save it to disk and use that as the base model for annotation. Maybe you can also use some match patterns to make the initial annotation faster.