I think I’m not understanding something basic about the API. If I need to categorize text into 20 classes, do I need to make 20 different datasets? Or do I need to pretrain a spacy model to randomly output those classes first?
Prodigy supports annotating multiple classes or labels at once, so you can do something like:
prodigy textcat.teach my_dataset en_core_web_sm my_source.jsonl --label POLITICS,ECONOMY
You can always keep adding more examples of different labels to the same dataset. When you use the textcat.batch-train command, Prodigy will read all available classes from the ones available in your dataset and train them.
When using Prodigy for text classification, there’s no explicit need for the spaCy model to know the classes beforehand. Depending on the data you’re working with and the classes you want to annotate, it might make sense to start off with a terminology list, which you can bootstrap using the terms.teach recipe. The list could either cover all classes, or you could create one for each class (depending on the data and how fine-grained the categories are). If you haven’t seen it yet, check out the end-to-end example of training an insults classifier with Prodigy. The example only covers two classes (“insult” and “not insult”), but the same approach  should work for a multi-class task as well.
Ultimately, it all comes down to experimenting with what works best on your data – and Prodigy can hopefully help with that 
Btw, a quick note on the annotation strategy: To make the most of the binary annotation UI, we generally recommend not annotating too many classes at once, especially if they’re very different content-wise. Moving through the examples quickly works best if you (or the annotator) can focus on one objective at a time and doesn’t have to spend much time reading and analysing the annotation task. For example, if you’re annotating whether a text is about food or about cars, switching between those objectives on each decision can make annotation less effective, so it might be better to annotate both classes separately. (This is mostly a UX psychology consideration, though.)
Oh ok. Say when I give 3 classes and reject the annotation, what does it do? Is the underlying model forced to be a one-vs-rest configuration so that it can use that as training data or is that data just ignored?
On the annotation strategy I get the idea, but are there any studies backing it up? Is it faster to do 1000 30-way annotations vs 30 * 1000 binary annotations? And how do you manage the active learning part if you decompose it as one-vs-rest? One class may need only 100 examples to plateau but another may need 5000.
Partly I ask because I tried doing one of my classes as binary but the true vs false case was very skewed and the classifier just always predicted no so it defeated the active learning.
Hi Keith,
Thanks for the questions. In order:
1. How the multi-class classification works
The model supports potential “multi-tag” classification — so each class is a neuron in the final layer, with the output scores compressed using a logistic transform. You can see the network definition here:
Note that there are really two models defined here: a small model for learning quickly, and then a larger model for when you have more examples. In each model, the last weight layer is an Affine layer initialized to zero, with no dropout. The number of output neurons here matches the number of classes being predicted. Normally a softmax transform would apply across all of the classes, so that the scores sum to 1. We instead perform an elementwise logistic transform, and interpret each score >= t as a prediction of True. I suggest t=0.5 is usually sensible.
2. Experimental evaluations
We plan to organise some experiments once the system is more stable — we don’t want to run the evaluation now and then have it invalidated by the next round of changes.
I think it’s important to make the experiment very directly evaluate the system being discussed. I’m always frustrated when tools or products claim “scientific” support from studies that address very different experimental setups from the tool itself. If we’re discussing usability, I don’t expect to see many linear relationships, which really limits the generality of any finding.
3. Imbalanced classes
The active learning should work really well for imbalanced classes. However it’s important that it sees some positive examples at the start. If you have a look at Ines’s tutorial, you’ll see how to encourage that by first building a terminology list, and using that to help bootstrap the initial classifier.
So there’s a single model with logistic output? When only one class is annotated, do you only backprop the error from that one unit?
So say you annotate car=no but truck=? and bike=? would the target be like
[0, NaN, NaN]
or
[0, 0.5, 0.5]
or
[0, 0, 0]
Now that I’m looking at it, the model might benefit from using the range [-1., 1.] instead of [0., 1.]. Not sure whether I’ve tried that.
Let’s say there’s only one class, is_vehicle. If you have:
{'text': 'car', 'label': 'is_vehicle', 'answer': 'reject'},
{'text': 'truck', 'label': 'is_vehicle', 'answer': 'ignore'},
{'text': 'bike', 'label': 'is_vehicle', 'answer': 'ignore'}
The target will be [0.0], because the ignores are filtered out before the update is performed on the batch. If you have multiple classes:
{'text': 'car', 'label': 'is_vehicle', 'answer': 'reject'},
{'text': 'truck', 'label': 'is_vehicle', 'answer': 'ignore'},
{'text': 'bike', 'label': 'uses_road', 'answer': 'accept'}
The gradients will be zeroed for classes for which no feedback is provided. So you’ll get:
# Output scores:
'car': {'is_vehicle': 0.7, 'uses_road': 0.9},
'truck': {'is_vehicle': 0.6, 'uses_road': 0.87},
'bike': {'is_vehicle': 0.93, 'uses_road': 0.2}
# Gradient
# Output scores:
'car': {'is_vehicle': -0.3, 'uses_road': 0.0},
'bike': {'is_vehicle': 0.0, 'uses_road': -0.8}
…But when I went to the implementation to link it:
This doesn’t look correct. The gradient looks wrong for the missing values.
Hi Matthew,
Are the gradient issues sorted out for the missing values? (ner, tagger, and textcat)
If I were to do this in PyTorch, the strategy would be to add a zero mask where the values are missing? (instead of futzing with zero-ing gradients)
I did fix that bug in the textcat, yes: https://github.com/explosion/spaCy/blob/develop/spacy/pipeline.pyx#L936 . It should be working in the tagger too.
I recently made improvements to the way missing values are handled in the parser and NER as well, but I’m not sure they’ll be relevant to Prodigy.  They’re on the develop branch, and will be released into the forth-coming v2.1.0a0, which will be published on spacy-nightly.
With the spaCy update, would it possible to pass in REJECT samples in the nlp.update API for the NER model?