So, first you'd need to create your first model.
First, make sure your annotations are in the right format for spaCy or Prodigy.
Here's a post where we go through the data format:
You'll need to make a decision if you want your classes are mutually exclusive (textcat
) or not (textcat_multilabel
) which can affect how the data is formatted. See that post for more details.
Then once your annotations are into the Prodigy DB, use data-to-spacy
to export out the annotations and train with spacy train
. You can train with prodigy train
instead -- it has the advantage is it's quicker to start, but it's harder to reconfigure down the road as it hides the config
file so I tend to recommend learning data-to-spacy
/spacy train
earlier.
Also you may want very early to create a dedicated holdout (evaluation) dataset. This will make your experiments down the road much easier to read as your evaluation dataset is staying the same. If you don't specify a dedicated hold out dataset, Prodigy will create a random partition for evaluation. However, this can change each time so if you rerun, you may get different results simply due to a new holdout (evaluation) dataset. Be sure to use the eval:
prefix with either data-to-spacy
or prodigy train
, e.g., --textcat dataset,eval:eval_dataset
.
Sort of. Here's a post (see slides, which cover NER but same idea applies to textcat
) that provide some detail. Essentially, updating is made for the known (binary) labels and the other labels are treated as missing.
What's worth mentioning is that this approach was designed for a reasonable number of labels; when working with 100+ labels, I'm a little more skeptical on how well this would work (especially if you don't have a well trained model already, then doing textcat.teach
). The problem is textcat.teach
assumes you have a model that can measure uncertainty well, that is it "knows" what it doesn't know. The problem is if you have only a small amount of data across all labels, especially perhaps imbalanced for many labels, it's hard for textcat.teach
to work well if it doesn't know what it doesn't know (aka it can't measure uncertainty well).
I would recommend a "bottom-up" approach, where you start with good/well-balanced labels, and then only add imbalanced/rare/poor-performing labels slowly:
- before applying to your entire dataset (100+ label), start with a small subset of labels (6-8) that you know you have a good number of labels. Train an initial model on only those labels. This will give you a good idea of a good benchmark model. Maybe if one or two of those labels aren't performing as high as you'd like, you could add more annotations for them and then retrain a new model from scratch.
- Then, slowly expand to add more labels in small groups. You'll first need to add them into your training, then use prior knowledge to focus on the labels that need the most help -- for example, severely imbalanced or poor model performance.
- Consider using
textcat.correct
too instead oftextcat.teach
early on.textcat.correct
will still use the model's prediction in the UI (so makes it a little easy as your job is to correct). You can even pass a--threshold
parameter where you consider annotations based on some threshold. - Only use
textcat.teach
when you have sufficient number of examples for that label. Also consider using patterns (see docs) in combination.
Hope this helps!