It's very possible that we have some unideal settings for large-class problems, because the architectures in Prodigy were mostly optimised for datasets with fewer categories. That said, I'd expect to usually match the performance of the FastText textcat models, because our model architecture should be able to extract the same information they're extracting.
I think the most likely problem here is that the model isn't being set up to predict mutually exclusive classes, which is why you've had to generate those negative examples. If you're training the model with FastText, I'm assuming your data is such that only one label is correct per example. So the important thing is to make sure spaCy is set up with that knowledge.
I would go ahead and use spaCy directly, rather than using Prodigy textcat.batch-train
, simply so that you have one less layer of software. It also means you'll be training the model with open-source tooling, which is always going to be preferable to having your automation depend on a proprietary tool (even when the proprietary tool is ours -- I couldn't give the opposite advice with a straight face ) .
You should be able to use the example script here: https://github.com/explosion/spaCy/blob/master/examples/training/train_textcat.py
Two things are important:
- Make sure you're passing the
"exclusive_classes": True
setting. - Make sure you're setting up your
"cats"
dict so that one label is 1.0, and all of the other labels are provided as0.0
. In other words, you need a "dense" format, with no missing values.
@adriane has been working on the usability of the textcat to make these things easier, and to make sure the process is less error-prone. But you should already be able to see useful results quite quickly.
One outcome of the experiments Adriane has been running is that the "architecture": "bow"
setting often performs very well. I would be sure to try that out first, especially in your early experiments while you're trying to get things set up correctly. It will run far faster than the "simple_cnn"
, which should speed up your process of getting everything correct.