My task is NER on short sentences.
These sentences are from user inputs and may contain any language (e.g. CJK).
Language detection seems not required, as NER uses surrounding words to extract entities, it works just fine on Spanish or other similar languages given enough training data.
But for CJK languages, the entire sentence is treated as one word, therefore no entity can be extracted.
For same reason it is also not possible to annotate training labels for these sentences.
Of course I can special case these languages and preprocess them with a segmentation step, use dictionaries etc. But that would be a lot of additional work, and has to be maintained separately (i.e. not automatically trained from data).
Is there a way to annotate individual characters in a sentence, so the model learns how to segment the sentence, regardless of language?