I am trying to use Prodigy to annotate text documents with multiple classification labels (each observation can have more than 1 label). I used a custom recipe with the Choice interface to generate the first annotations with all labels.
Now, I would like to use the Teach recipe with active learning to annotate each of the labels individually. Lets say I have 4 labels: A, B, C, D. I will use Teach with each label at a time and I want Prodigy to suggest the documents which are more likely part of the respective label for me to accept or reject the annotation. I want to do this because it will be a faster way to annotate and a better use of Prodigy.
How can I go from my initial dataset of annotations created with the Choice interface to using Teach with a single label (e.g. label A) to generate more annotations? (and then repeating the process for label B, label C, label D)
Ideally, you probably want to use your existing annotations to pre-train a model and then use that as the base model for textcat.teach and improve it further. For this step, you might actually find it easier to train the text classifier directly with spaCy.
Once you've collected annotations using your custom recipe, you can export the data to a file:
prodigy db-out your_dataset > data.jsonl
Each entry will probably look something like this:
To train the text classifier (and pretty much all other components) in spaCy, you need a text and a dictionary of the annotations – for the textcat component, that's the "cats" and the label values mapped to booleans (whether the label applies):
Getting this data from your chocie annotations should hopefully be a pretty straightforward data transformation. You can find more details and examples of spaCy's training loop on this page. You can start off with a blank model or with a pre-trained model (if you also want to keep the other pre-trained components like the tagger and parser).
Here's a quick summary of the training process that shows the concept. The detailed example in the docs also shows how you can evaluate your model during training. Since you're only pre-training it so you can improve it later, it's okay if the model isn't perfect
nlp = spacy.load('en_core_web_sm') # or: nlp.blank('en') for blank model
textcat = nlp.create_pipe('textcat') # create new text classifier
for label in ('LABEL_ONE', 'LABEL_TWO'): # add labels
textcat.add_label(label)
nlp.add_pipe(textcat) # add text classifier to the pipeline
with nlp.disable_pipes('tagger', 'parser', 'ner'): # only train text classifier
nlp.begin_training() # initialize the weights
for n in range(20): # number of iterations
losses = {}
random.shuffle(training_data) # shuffle converted data
batches = spacy.util.minibatch(training_data) # batch up data
for batch in batches:
texts, annotations = zip(*batch)
# update the model
nlp.update(texts, annotations, drop=0.2, losses=losses)
print("Losses", losses)
nlp.to_disk('/path/to/model') # save the model to a directory
The pre-trained model will be saved to /path/to/model and you can pass this directory in as the spacy_model argument when you run Prodigy. For example:
After you've collected some annotations, you can run Prodigy's textcat.batch-train with the custom pre-trained model as the base model and see how your binary annotations improve the model.
Hi Ines! Thanks a lot for your help. I have managed to follow your suggestion step-by-step until the end.
I converted the data, trained a multilabel model with spacy, saved the model to disk, and used it as a pre-trained model to help me annotate with textcat.teach for one of the labels.
I have 2 follow up questions:
If I generate binary annotations with textcat.teach and then do textcat.batch-train to train the multilabel model with those binary annotations, what exactly is the model being trained on? Does prodigy assume that all other labels are False?
While doing binary annotation for LABEL_ONE, all the documents which textcat.teach suggests are negative examples (none of them are LABEL_ONE). I would like to know how to make textcat.teach suggest the observations with highest probability of belonging to LABEL_ONE so that I can generate more positive annotations? (instead of the most uncertain scores. I know there must be a lot of value in choosing the uncertain ones, but it seems that it’s not the best when you have an unbalanced multilabel dataset and you’re just getting started)
The default text classification model (via spaCy) assumes that categories are not mutually exclusive – so if you update the model with a text plus a category, the update is only performed for that label and all other labels are treated as unknown / missing values. Prodigy uses the same approach for binary NER annotations btw – my slides here show an example of this process.
Yeah, this sounds reasonable. The uncertainty sampling is performed by the prefer_uncertain sorter, which takes a stream of (score, example) tuples and yields examples. Under the hood, it uses an exponential moving average to determine whether to send out an example or not. Instead of prefer_uncertain, you can also use the prefer_high_scores sorter, which has the same API, but prioritises high scores.
So in recipes/textcat.py, you could update the teach recipe like this:
from prodigy.components.sorters import prefer_high_scores
# in the recipe:
stream = prefer_high_scores(model(stream))
Our prodigy-recipes repo also has a simplified version of the textcat.teach recipe with a bunch of comments explaining what's going on. So you might find this useful as well as a starting point to write your own custom version: