When I used prodigy train textcat recipe, I got a very good model with F1 0.937! My dataset size is 700+. That's great. However I used similar data set and trained a spacy text classification model I got a much worse F1 score, 0.41. Why the performance is so different?
In my training, I copied the textcat training sample code from the spacy website -https://spacy.io/usage/training. When I trained the model, I have a data pre-process. My data pre-process includes removing stop words, using the lemma, removing punctuation, removing numbers, converting all letters to lower case. My question is why my model's performance is much worse?
Is it because my pre-process doesn't work with Spacy Textcat algo?
Is it because Prodigy train has some optimizations.I If so, what are they?
Are you using the preprocessing logic in both spaCy and when you train with Prodigy, or just in spaCy? If you're only preprocessing when you're training with spaCy, it's definitely possible that this has a big impact. Removing stop words and punctuation is a pretty significant change, and those words may still hold relevant information. So you typically don't want to do that type of preprocessing and just train your classifier on the original text.
However, augmenting your data with preprocessed examples is different. I probably still wouldn't recommend removing stop words, but lowercasing all texts and then adding them to the existing training data could be a good strategy to make the model less case sensitive.
Prodigy's train recipe is a pretty thin wrapper around spaCy's training API and it doesn't do anything special – the main thing it adds is logic to merge and convert datasets created in Prodigy. So the results should be consistent – just make sure you're setting the same settings for things like mutually exclusive categories. (If your categories are exclusive and you don't set that in Prodigy, but do set that when you train wtih spaCy, that can potentially make a big difference, too.)
Yes, my Spacy train used pre-processed data, but Prodigy train has no data pre-processing.
Then I tried Spacy train without pre-processing. I got precision 1.00, recall 0.33. F1-Score 0.5.
I didn't mention that my data is imbalanced. I have about 39 positive cases(5%) and 721 negative cases. When I train in Spacy, I used stratify to ensure that the evaluation set contains the same distribution of positive cases.
Does Prodigy keep the data distribution in the eval set during the training? If not, only negative cases in the eval set could explain why Prodigy train has much better performance.
Yes, I think your analysis makes a lot of sense. If you use both preprocessing and a different method for selecting the evaluation set, the results can end up very different.
The default eval split in Prodigy will just hold back a certain percentage of the examples after shuffling. So if your dataset is very imbalanced, you could end up with only 1 positive example in the evaluation set and if your model learns to never predict that category, that's easily a very high 90% accuracy
So once you get serious about evaluation, you probably want to use a dedicated evaluation set. And maybe try and skip the preprocessing when training with spaCy, or at least reduce it and don't remove words (or only add to the existing data).