Could Prodigy work for detecting code-switching in text?

wpd10 · January 25, 2021, 8:47am

I have a problem of detecting code-switched spans in text. These are often named entities, but could also be phrases. I've considered some various approaches. One would be to simply use the default spaCy NER model followed by text classification, while another would be to train a NER with language as labels. In any case, I wonder how effective the neural networks in Prodigy are at capturing subword information, for example presence and order of certain characters which is highly language-discriminative. Hope you have some thoughts, as I'd love to use this tool to speed up my work!

Martin

honnibal · January 27, 2021, 10:21am

I think detecting whether code-switching is occurring in a sentence should be pretty easy, as a classification task. It would also be easy to make the annotations this way. Perhaps you could try that?

wpd10 · January 27, 2021, 1:58pm

Hi Honnibal,

Thank you for your answer. I think using text classification to determine whether code-switching occurs in a sentence is a good idea, as a first pass, but I still need to find exactly where in a sentence it occurs.

That's why I was thinking about treating it as a NER problem. What do you think about that idea? The languages I need to detect could be low-resource languages, and I expect many OOV tokens (or atleast ones that don't have very meaningful vectors) to show up. Besides, detecting language is less of a semantic issue than one of allowable letter sequences, though I definitely expect context to help.

I suppose my question is if Prodigy has a way of processing text at the character level for NER?

Thank you again for your time.

honnibal · February 4, 2021, 12:19pm

Prodigy does support subword annotation, but I don't think that should be necessary for your problem. You should be fine doing the annotation at the token level, right?

spaCy's models do have features that consider subword information, so I think you might be able to train an entity recognition model to do the code-switching task. It does make sense.

wpd10 · February 5, 2021, 1:15pm

That's right, annotating at the token level should be fine. It's good to hear that subword information is considered by the models. I'm also glad to hear that you find the NER approach to my problem sensible.

I'll go ahead and try this out in the very near future, and will probably report back my results here just in case someone else is interested in this task or a similar one.

Thanks again for your time!

Martin

wpd10 · April 14, 2021, 12:31pm

Just wanted to say that this worked out well for me. With relatively few training examples, my models were able to identify and discriminate between named entities in different languages, thus performing a combination of NER and Language ID with one model. Just in case someone is looking to solve a similar problem.

Martin

SofieVL · April 15, 2021, 6:28am

Hey Martin,

That's great to hear, thanks for let us (and future readers) know!

Topic		Replies	Views
Support for Japanese annotation in Prodigy ner , spacy	1	914	September 2, 2019
How is the support for Languages other than English? usage , spacy	4	3366	March 17, 2020
model with subword usage , custom	1	447	February 6, 2020
ner.train on data not annotated by Spacy? ner	3	1153	June 11, 2018
Does prodigy support Portuguese usage , solved	3	520	September 10, 2018

Could Prodigy work for detecting code-switching in text?

Related topics