I'm happy to announce that we've got yet another Prodigy tutorial on our Youtube channel! This one is about bulk labelling. It's a technique that re-uses language models to allow you to interactively search for interesting clusters that you can immediately label. You can watch the video here:
Great and insightful video (as always)! Given that I am currently working hard on a problem that fits well into this context, I'd be quite interested in your opinion on what the most promising strategy might be for the given problem. And in particular, whether bulk labeling could be a reasonable approach here or if I am already to deep in the weeds.
Problem Description: In short, my task involves texts from the business domain (essentially Q&A-like conversations). For each doc, I am interested in what is the share of conversation that is devoted to topic X (I am very specific about that topic so distanced from topic modeling in the first place). So that in the end, I would have a measure indicating the importance of topic X in the given conversation/doc. You could essentially reduce it to a bianry classifcation task. Currently, I approach this problem in two ways, and I am still not 100% sure if I picked the best route.
Strategy 1: I start with a set of “seed words” where I have a strong prior that these terms are reflective of topic X. I then trained a FastText embedding model on the entire corpus and used it as a kind of semantic search tool. That is, expanded the seed word list by repeatedly querying the embedding model to find terms that occur in similar contexts as the seeds. In the end, I obtained a more extensive wordlist with which I could count the number of terms reflective of topic X per document (and normalize per doc length).
Strategy 2: I had 3 annotators label 10,000 randomly sampled sentences using prodigy (i.e., assign a label 1 or 0 if the sentence was about topic X). I experienced that it is quite hard to clearly identify topic X given its rather ambiguous nature. I used the sentences and labels to train a Distil-RoBERTa model while oversampling the sentences with high agreement among the annotators (topic X appears rather rarely, i.e., in 2-10% of the sentences). I could then use the trained classifier to apply it to all 5 Mio. Sentences/paragraphs (I was also thinking about using a service like DeepL to translate and backtranslate the training data to obtain more variation, but haven’t yet implemented this strategy).
I also thought about using the 10,000 hand labelled sentences in conjunction with something like sentences transformers (akin to bulk labelling), to retrieve more sentences with similar contents. In the end, my feeling is, that the trained classifier would be more precise at identifying topic X compared to using a pre-trained SentenceTransfromer trained on some generic corpus (because the weights in my classifier are optimized for my topic of interest).
I'd be super grateful for any opinion/perspective on my strategy.
Simon
There are a few things that popped into my mind as I read this.
- Pretrained language models might be trained on a dataset that doesn't resemble your domain. So it might make sense to just try out a few!
- Having a clear definition is also likely going to be the main concern at this point. Bulk labelling won't help you come up with a good definition of a topic if it is ambiguous. I might be able to help more if you're able to share more details about the topic.
- Bulk labelling is more meant as a starting point to help get annotations, not as a forever practice. Once you already have a few annotations that you're content with it might be more beneficial to use active learning instead.
Just to check, are you trying to collect the first batch of training data and seeking advice for that? Or are you seeking advice on how to model your problem?
Thanks for your perspective! There a few hints that I'll pursue further (e.g., LM with different training data distributions for robustness).
To provide some more context:
- The topic that I am concerned with is ultimately ambiguous (i.e., not as easy to identify such as cat vs dog). Let's say the topic is "politics" and there is not the one clear definition for the concept which I could teach annotators. Its more like I can describe the topic broadly, provide diverse examples in a coding book, provide various definitions with some overlap. Therefore, I am thinking about a technique that captures the concept broadly while accepting but mitigating the obvious risk for FP and FN given the ambiguity of the concept.
- As you pointed out, I am more concerned with your second question, i.e. seeking advice on how to model the problem most effectively.
However, while writing this, I realise that it is probably not straight-forward to give a clear recommendation for a problem such as mine...
My best advice is to maybe have regular meetings with folks who annotate and also to have a diverse group of folks annotating. Once in a while, say once per week, you can sync to discuss hard-to-label examples.
You'll need to iterate, but this is a path towards annotation consensus.