I have calibrated a Norwegian BERT model (Norbert) on a large amount of emails (more than 17000) that I would like to categorize into various mutually exclusive categories. The model performs fairly well, but I know that some of my email labels in the training and test data are incorrect, since they were made by regex and rules in python. So my questions are:
- Can I use Prodigy to improve my labeling without having to go through all the wrongly predicted emails manually?
- Can I somehow combine the Bert model with Spacy to try to improve the model performance further?
I think there should be a potential to use Prodigy to improve my work, so I would really appreciate some input on this.
Hi! If you already know that some of your training/test data is incorrect, this might be a good place to start – at least the evaluation data, if you haven't done so already. It'd probably be good to at least work on the evaluation data manually to make sure it's 100% correct because otherwise you can't evaluate reliably and it'll be difficult to determine whether any changes you make actually improve the performance, because the evaluation is skewed.
You can do this in Prodigy by using
textcat.manual with pre-labelled examples in the expected JSON format and correct the mistakes, which should hopefully be fairly quick.
Once you have a stable evaluation, you could either start by running the model over unseen examples and doing example selection to find the best texts to annotate, or you can focus on improving the original dataset you have.
Maybe you can even replace your auto-generated regex-based training data with a smaller and more "curated" dataset that gives you better performance overall – that should definitely be doable, especially if you're already starting out with transformer embeddings. (For instance, maybe you only need a couple of hundred really good examples to get the same accuracy as before, which is more maintainable as well!)
If you're annotating previously unseen examples with your model in the loop, you could focus on the most "problematic" examples first, e.g. those with the most uncertain scores. If you can get a accuracy breakdown per label, you could also see if there are any labels that the model performs less well on and focus on correcting those first.
The first thing to try out (if you haven't done that already) would be to generate a training config for a transformer-based pipeline with a
textcat component and initialised with the
norbert embeddings, train that on your existing data and see how you go This model can also be loaded into the built-in Prodigy workflows like
ner.teach, so it's easy to collect more annotations and improve it further.
Another idea, since you mentioned that you already have Python rules and regex: even if these rules aren't perfect, it'd be interesting to perform some error analysis and see which rules are reliable and which aren't. You can do this in Prodigy by pre-labelling your examples with your rules and clicking accept/reject to record how often you agree with the rules. If some of the rules yield very high accuracy, you could incorporate them into your spaCy pipeline using a custom component to boost the model's performance for cases where you already know the answer very reliably.
Thanks a lot Ines,
I will test out to use textcat with norbert embeddings. It will be exciting to see how it performs compared to the calibrated Bert model. I will use textcat to evaluate around 1000 wrong predictions then, not my favourite task
I think I need a large volume of emails since the variation in the emails within one category can be very large. I work for an insurance company, and you wouldn't imagine how many ways you can report a claim
And questions about claim conditions should be classified different than just reporting of a claim, so it is sometimes a very close decision where to categorize it.
This is my first "real" project in Prodigy, so it would be great to make it work!