Could Prodigy work for detecting code-switching in text?

I have a problem of detecting code-switched spans in text. These are often named entities, but could also be phrases. I've considered some various approaches. One would be to simply use the default spaCy NER model followed by text classification, while another would be to train a NER with language as labels. In any case, I wonder how effective the neural networks in Prodigy are at capturing subword information, for example presence and order of certain characters which is highly language-discriminative. Hope you have some thoughts, as I'd love to use this tool to speed up my work!


I think detecting whether code-switching is occurring in a sentence should be pretty easy, as a classification task. It would also be easy to make the annotations this way. Perhaps you could try that?

Hi Honnibal,

Thank you for your answer. I think using text classification to determine whether code-switching occurs in a sentence is a good idea, as a first pass, but I still need to find exactly where in a sentence it occurs.

That's why I was thinking about treating it as a NER problem. What do you think about that idea? The languages I need to detect could be low-resource languages, and I expect many OOV tokens (or atleast ones that don't have very meaningful vectors) to show up. Besides, detecting language is less of a semantic issue than one of allowable letter sequences, though I definitely expect context to help.

I suppose my question is if Prodigy has a way of processing text at the character level for NER?

Thank you again for your time.

Prodigy does support subword annotation, but I don't think that should be necessary for your problem. You should be fine doing the annotation at the token level, right?

spaCy's models do have features that consider subword information, so I think you might be able to train an entity recognition model to do the code-switching task. It does make sense.

That's right, annotating at the token level should be fine. It's good to hear that subword information is considered by the models. I'm also glad to hear that you find the NER approach to my problem sensible.

I'll go ahead and try this out in the very near future, and will probably report back my results here just in case someone else is interested in this task or a similar one.

Thanks again for your time!


Just wanted to say that this worked out well for me. With relatively few training examples, my models were able to identify and discriminate between named entities in different languages, thus performing a combination of NER and Language ID with one model. Just in case someone is looking to solve a similar problem.



Hey Martin,

That's great to hear, thanks for let us (and future readers) know! :slight_smile: