I am pretty happy with the performance of my classifier in identifying clear positives/negatives. I am looking at the "messy middle" of my classifications and trying to figure out best next step.
What I did:
Generated a data set of 17k sentences with scores between .4 and .6
Annotated 1000 records
Re-trained using 6000 accepts/rejects (.2 eval)
Using this model, I did a spot check and feel like there wasn't much movement in some of these edge cases (even the ones that made it into the training set).
Trying to figure out next best move... ny considerations:
Annotate more of the messy middle (.4 - .6) data set (16k more)
Annotate soft-REJECTs (.2 -.4) and try to get more REJECTs into the data set
Look at my ACCEPTs and REJECTs and IGNOREs to see if I made some errors
Try a different model (en_core_sci_scibert isn't working on my M1)
Think about the problem differently. (thus, this post)
I watched a presentation by Vincent on his doubtlab library, was curious if that could help here:
since you already have a fair amount of examples annotated, have you tried changing your config to use a heavier model? When you go to the spaCy docs you can generate a config that's more aimed towards accuracy. The model that's trained with such a config could also help improve your metrics.
When you look at your "messy middle", would you say that these are easy to annotate? If they are hard to annotate they are likely hard to model too.
Doubtlab could help you find examples that have been mis-annotated, but it won't help you design a better annotation process. What label are you annotating towards?
Looks like adding more training data will be useful given this curve:
% prodigy train-curve --textcat-multilabel my_label --base-model en_core_sci_lg --config ~/.prodigy/config.cfg
========================= Generating Prodigy config =========================
ℹ Using config from base model
✔ Generated training config
=========================== Train curve diagnostic ===========================
Training 4 times with 25%, 50%, 75%, 100% of the data
% Score textcat_multilabel
---- ------ ------
0% 0.46 0.46
25% 0.76 ▲ 0.76 ▲
50% 0.83 ▲ 0.83 ▲
75% 0.84 ▲ 0.84 ▲
100% 0.86 ▲ 0.86 ▲
✔ Accuracy improved in the last sample
As a rule of thumb, if accuracy increases in the last segment, this could
indicate that collecting more annotations of the same type will improve the
model further.
I ran this twice (once with and once without the "vectors" config value set) and got the same result.
The model I am using (en_core_sci_lg) is the largest I have been able to get to work on my M1. I will try tweaking the config to see if I can get the BERT-based SciSpacy model (en_core_sci_scibert) to work.
The "messy middle" are definitely the more challenging cases. Guess all signs point to more annotations is the right next step and 90% might be the best this can do, although I was hoping for closer to 95% (But maybe if my dataset is loaded with the "messy middle" my accuracy is actually higher than what it says above)
Another angle to explore perhaps is to consider that "more accuracy doesn't always imply more utility". It can be completely fine if a model fails sometimes, as long as the failure scenarios are well understood.
Here are some more good questions to try and answer:
When you have a look at instances where the model fails, is there any knowledge you can learn from that?
Are there more false positives or false negatives?
How often do the examples from the "messy middle" occur in real life?
Are certain mistakes more costly than others?
Might it make sense to tune the threshold of the classifier?
I'm mentioning this because I used to overfit my career on accuracy numbers, and I've noticed that a lot of meaningful progress can come from taking a step back and to consider how the model will be used in the bigger picture. A ML model, in the end, is more like a cog in a system. And it's the system that we usually want to improve, not the cog.
This is just a "cog" in our augmented intelligence approach to NLP. We have a team of really smart folks curating biomedical text by hand and we use "NLP" tactics to improve and speed up this process.
Up until now, we have used a lot of rules-based and basic NER models to get decent results. One pattern that I am excited about is using Seq2Seq models (specifically T5) to improve my data extraction techniques.
I tried feeding everything into T5 and letting it suss out what was what. I am now using Prodigy/Spacy to categorize sentences before I feed them into different, targeted T5 processors.
Basically, I am building a NL (natural) to DSL (domain specific) language engine. Super powerful thus far and better suited for my use cases than other tools.
Feel free to contact me ian@ (my company's domain) if you'd like more info.
BTW - Great news, I figured out a way to pump my data up with 200K "rejects" that didn't require manual annotations. I loaded it and got really great results from my spot checks. I was worried overloading with rejects would "drag down" my positives, but it seems to have done the opposite (spread out the messy middle effectively).
Thanks Vincent for all your help and all your great YT and calmcode content!