I have a text corpus (specifically emails) that contain different language backgrounds (eg. English and Chinese). It could even be the case that 2 different languages are present in a single sentence. Even though we can perform manual annotations on substantial volume of examples, it may still raise an issue because the tokenization differs significantly for these 2 different language models.
My end goal is to perform extract useful entities regardless of languages. For example, if a name exists in the corpus in both English and Chinese, I want both spans of text to be extracted. Is there an efficient way of performing NER in this scenario? Thanks!
In this case I think it would be easiest to have a single tokenizer that can segment both Chinese and English. Both Chinese segmenter options jieba and pkuseg with spacy_ontonotes seem to perform fine on basic English spaces and punctuation.
jieba will be a lot faster, but I had less success when I tried to customize it for a particular segmentation. You'll just have to evaluate them for your task.
This does mean that your pipeline will have Chinese lexical attributes for things like like_num and stop_words. This shouldn't make any difference for a standard ner component, but it's possible you might have other components like a Matcher that use lexical attributes. To handle both, you might need to implement a custom language, but if the only issue is segmentation, I'd give a base Chinese pipeline with jieba or pkuseg a try.