NER on multilingual texts

Hi founders, and community,

I have a text corpus (specifically emails) that contain different language backgrounds (eg. English and Chinese). It could even be the case that 2 different languages are present in a single sentence. Even though we can perform manual annotations on substantial volume of examples, it may still raise an issue because the tokenization differs significantly for these 2 different language models.

My end goal is to perform extract useful entities regardless of languages. For example, if a name exists in the corpus in both English and Chinese, I want both spans of text to be extracted. Is there an efficient way of performing NER in this scenario? Thanks!

In this case I think it would be easiest to have a single tokenizer that can segment both Chinese and English. Both Chinese segmenter options jieba and pkuseg with spacy_ontonotes seem to perform fine on basic English spaces and punctuation.

You can create a base model that just contains a tokenizer like this (see

cfg = {"segmenter": "pkuseg"}
nlp = Chinese.from_config({"nlp": {"tokenizer": cfg}})

This is the same tokenizer from zh_core_web_sm. If it doesn't handle some cases correctly, you can add exceptions or fine-tune the model, or start from scratch and train your own model: GitHub - explosion/spacy-pkuseg: pkuseg多领域中文分词工具; The pkuseg toolkit for multi-domain Chinese word segmentation

jieba will be a lot faster, but I had less success when I tried to customize it for a particular segmentation. You'll just have to evaluate them for your task.

This does mean that your pipeline will have Chinese lexical attributes for things like like_num and stop_words. This shouldn't make any difference for a standard ner component, but it's possible you might have other components like a Matcher that use lexical attributes. To handle both, you might need to implement a custom language, but if the only issue is segmentation, I'd give a base Chinese pipeline with jieba or pkuseg a try.