These sentences are from user inputs and may contain any language (e.g. CJK).
Language detection seems not required, as NER uses surrounding words to extract entities, it works just fine on Spanish or other similar languages given enough training data.
But for CJK languages, the entire sentence is treated as one word, therefore no entity can be extracted.
For same reason it is also not possible to annotate training labels for these sentences.
Of course I can special case these languages and preprocess them with a segmentation step, use dictionaries etc. But that would be a lot of additional work, and has to be maintained separately (i.e. not automatically trained from data).
Is there a way to annotate individual characters in a sentence, so the model learns how to segment the sentence, regardless of language?
What are you using as the base model? CJK languages definitely require different tokenization, since a "word" is not defined as a whitespace-delimited unit. spaCy currently supports Chinese and Japanese via third-party libraries (see here), so you can use those language classes as the base model. See this thread for more details:
If you use a base model that supports tokenization for the given language, you'll be able to annotate the tokens accordingly. (This is also one of the reasons Prodigy always asks for a base model – it lets you supply language-specific or even your own custom tokenization rules.)
Btw, speaking of learning tokenization: In the upcoming version of spaCy, the parser will be able to learn merging tokens, which will be very useful for training CJK models. The rule-based tokenization can then only split on characters, and the model will be able to predict whether characters should be merged into one token. Depending on the data, this can improve accuracy by a lot.
I used a blank model, and bootstrapped it with manual annotations.
I was hopping the model could be multilingual, and it did support all whitespace-delimited languages without problems.
If I have to specify a pre-trained model as base model, does it still work with other languages?
If not, I think it may not scale well, as there are languages other than CJK that are not whitespace delimited. Which means a model for each of them, plus language detection code (which is also not trained from data).
Honestly, it’s pretty normal that different languages need different models! The information required to process one language is usually quite separate from the information needed to process another. For languages which are similar and share a writing system, there can be some overlap, so there’s an active research area on cross-lingual models.
A small example of why having different models for different languages is good. English and German are two quite closely related major languages — they’re much more closely related than most other pairs you’d hope to work with. However, in English capitalisation is a vital clue for NER, as only proper nouns and sentence-initial words are capitalised. This isn’t so in German: German capitalises common nouns as well. This means that an English+German NER model is solving a much harder problem. It needs to learn that in English, capitalisation matters, while in German it doesn’t.