Hi! If you already know that some of your training/test data is incorrect, this might be a good place to start – at least the evaluation data, if you haven't done so already. It'd probably be good to at least work on the evaluation data manually to make sure it's 100% correct because otherwise you can't evaluate reliably and it'll be difficult to determine whether any changes you make actually improve the performance, because the evaluation is skewed.
You can do this in Prodigy by using textcat.manual
with pre-labelled examples in the expected JSON format and correct the mistakes, which should hopefully be fairly quick.
Once you have a stable evaluation, you could either start by running the model over unseen examples and doing example selection to find the best texts to annotate, or you can focus on improving the original dataset you have.
Maybe you can even replace your auto-generated regex-based training data with a smaller and more "curated" dataset that gives you better performance overall – that should definitely be doable, especially if you're already starting out with transformer embeddings. (For instance, maybe you only need a couple of hundred really good examples to get the same accuracy as before, which is more maintainable as well!)
If you're annotating previously unseen examples with your model in the loop, you could focus on the most "problematic" examples first, e.g. those with the most uncertain scores. If you can get a accuracy breakdown per label, you could also see if there are any labels that the model performs less well on and focus on correcting those first.
The first thing to try out (if you haven't done that already) would be to generate a training config for a transformer-based pipeline with a textcat
component and initialised with the norbert
embeddings, train that on your existing data and see how you go This model can also be loaded into the built-in Prodigy workflows like ner.correct
and ner.teach
, so it's easy to collect more annotations and improve it further.
Another idea, since you mentioned that you already have Python rules and regex: even if these rules aren't perfect, it'd be interesting to perform some error analysis and see which rules are reliable and which aren't. You can do this in Prodigy by pre-labelling your examples with your rules and clicking accept/reject to record how often you agree with the rules. If some of the rules yield very high accuracy, you could incorporate them into your spaCy pipeline using a custom component to boost the model's performance for cases where you already know the answer very reliably.