NER tagging in non-alphabetic language

Hi, I've been trying Prodigy this afternoon on NER annotation task in Chinese, and I find it a bit confusing and haven't figured out the boundary thing yet.

In alphabetic languages like English, the minimal unit of a sentence is letter and words are naturally seperated by spaces, tokenization of Engish is different from Chinese, which has no space in between words.

So when I'm trying to do use this rel.manual recipe for joint ner and re annotation, while the english annotation works fine, I can't figure out the right span/token configuration that make the pre-annotated entity boundaries display correctly when it comes to Chinese texts. I tried to make each Chinese character a unique token, and tried to make the boundary end = start, or make the end = start + 1, meanwhile keeping the token ends inclusive in the token span setting, they just never manage to display properly, and whenever I toggle on the spans to annotate them, some tokens would just disappear as if the token index are messed up.

I suspect there's something about spaces in between tokens that are written in the source code to deal with tokens in alphebatical languages, or does it has anything to do with tokenization in spaCy, so it might not work in Chinese?

I've also noticed on top left of the page the project info, it says the language is in english. Is there a parameter for language maybe, so I can set it to Chinese to make it work?

Many thanks.

Hi @psychedelicactus , welcome to Prodigy!

The tokenization in the Prodigy UI depends on the spacy model you passed. If you're passing an English model, like en_core_web_sm, then it will follow that language's rules. You can try passing one of the Chinese models and see if it helps.