Merge annotations for multi label classification tasks (non mutually exclusive)

I'm having trouble to understand how to merge the annotations of a multi label classification task.
My approach is to label one at a time using textcat.teach and then merging the datasets.
However, the resulting dataset might include annotations where I have not yet reviewed a specific label.
Text1 - label_a=accepted, label_b=rejected (seen 2x)
Text2 - label_b=accepted, label_b=not yet reviewed (seen only when labeling b)

I there a way to ensure all texts have been reviewed for all labels?

Another approach would be to use the choice interface and label the texts directly for all labels, but then again if a new label would emerge there must be a way to review the existing annotated data. I.e. when in a class hierachy a new label is created.


The main goal of textcat.teach is to help you select the most relevant examples for annotation, by focusing on the examples the model is most uncertain about. This also means that less relevant examples will be skipped in favour of others – so you won't end up seeing every example. If your goal is to have an annotation for every label for every example, there's not really a point in asking the model to select examples for you, because you want to see every example for every label anyways.

It can still be a good strategy to focus on one label at a time, though, e.g. using textcat.manual with only one label. This makes it easier to work incrementally, add and remove labels, and test different combinations of labels. For complex or abstract label schemes, it can also help you focus and collect better data, because you only need to think about one label at a time, not all of them at once.

Btw, if you have an existing binary dataset (like the one you created with textcat.teach) and want to check which labels are still missing for which examples, you can use the _input_hash value of the example to identify annotations on the same texts, and then check the "label" and "answer" value. You can then queue up the unannotated text + label combinations for manual annotation again.

That's also an option, yes. You can always re-annotate existing datasets and Prodigy will use the existing annotations to pre-populate the UI. For example, let's say you've created a dataset with textcat.manual and labels A, B and C. When you introduce label D, you can re-run textcat.manual, load in dataset:your_existing_dataset as the input source, set --label A,B,C,D and save the results to a new dataset. Any existing annotations for A, B, and C will be displayed in the UI, and you can add annotations for label D (or even correct the previous annotations, if the introduction of label D requires that). If you make a mistake, you can delete the dataset and start over.

1 Like

Got it, thanks!

I've tried running textcat.manual new_dataset dataset:existing_dataset --label A,B,C,D but the UI is not pre-populated.

The existing dataset for example has this entry:


Ah, sorry, in my suggestion here I had assumed that the annotations would all come from textcat.manual with multiple labels (and as a result, view ID choice), so the task structure would be the same and the accepted options in the key "accept" would be passed through and pre-populated.

If the dataset has examples created with the classification UI and just a single "label" with an answer, that wouldn't work out-of-the-box. But you could use a bit of custom logic here to collect all examples with the same _input_hash (annotations on the same text), and then generate a list under the key "accept" that contains all "label" values that were accepted.