Custom multilabel categorization recipe

Hi,

I want to write my own recipe for multi-label text classification
I have 4 categories and a single text can belong to multiple of them.

I was wondering if it’s possible to take advantage of active learning also in this case and if so, how should I store the annotations (multiple labels) for a given text as a single entry in the database? As Prodigy by default works with binary annotations for each label. What’s important for me is that the text is displayed only once and all the corresponding categories are assigned at that time.

If that’s not possible, I could probably still use it as an annotation tool that would export the data and combine the single labels into a format that Spacy can handle.
Because from what I’ve read, Spacy can work with multi-label data, such as: cats = {"classA": 1.0, "classB": 0.0, "classC": 0.0, "classD": 1.0}, right?

Yes, spaCy’s text classifier supports multiple, non-exclusive categories. We usually recommend making several passes over the data, one for each category. This is often still faster, because the annotator only has to think about one label.

But it all really depends on the use case, so if you want to annotate all labels at once, the "choice" interface could be a good option. The custom recipes workflow has an end-to-end example of this:

To allow multiple selections, you also want to add 'config': {'choice_style': 'multiple'} to the components returned by your custom recipe. The data you collect from the recipe will have an added "accept" key with a list of all selected labels. For example:

{
    "text": "This is a text",
    "options": [
        {"id": "CLASS_A", "text": "Class A"},
        {"id": "CLASS_B", "text": "Class B"},
        {"id": "CLASS_C", "text": "Class C"}
    ],
    "answer": "accept",
    "accept": ["CLASS_B", "CLASS_C"]
}

If you want to train directly with spaCy, you can then convert that data to the cats format:

labels = ['CLASS_A', 'CLASS_B', 'CLASS_C']
training_data = []

for eg in collected_annotations:
    accepted = eg['accept']
    text = eg['text']
    # dictionary of all labels – if label is in accepted list, value is
    # set to True, otherwise it's set to False
    cats = {label: label in accepted for label in labels}
    training_data.append((text, {'cats': cats}))

Alternatively, you can also generate examples in Prodigy’s binary annotation style and use the built-in textcat.teach recipe. Here, you could just duplicate each example for each label, add the "label" value and set it to "answer": "accept" or "answer": "reject", depending on whether the label was selected or not.

1 Like

Thank you, this is really helpful!
Can I also modify textcat.batch-train to work with such data as it doesn’t seem to handle it out-of-the-box?
One option would be to create a new datapoint for each label that follows binary annotation style and then use that with batch-train function, but it seems like a suboptimal solution.
I was also wondering, how the model updates are done (active learning part) in case of a multi-choice setup?

Yes, that’s what I would suggest. The model allows the categories to be non-exclusive, so it expects one text plus label for each example. So using a similar script like the one above, you could create a set in the binary style, and then train the classifier from that. (It’s a bit verbose, but the reason the batch training works like that is that it also allows you to easily train from sparse annotations.)

If you want to use the choice interface with a model in the loop, you’d have to write your own recipe based on textcat.teach. The update part isn’t actually that different – when the answers come back to the server, you’d convert each example to the binary format for each label, and then call the model’s update method. The updating that’s done in the loop follows the same mechanism as the updating when you run textcat.batch-train. The only difference is that the batch training makes several passes over the data and uses some other training tricks to achieve better results.

The part that’s a bit trickier is the sorting of the stream. One of the reasons the built-in textcat.teach focuses on one label at a time is that it also makes it easy to define a clear objective: we can get the score for label X, and sort by uncertain predictions. If you want to look at all labels at once, defining the objective becomes more difficult. (The best solution I can currently think of is to keep focusing on one label and use its score to determine whether to show the example or not – but also output all other scores, so you can pre-select the other labels in the choice interface. But I haven’t tried this yet).

TL;DR: If you want to annotate all labels manually and at once in the "choice" interface, a “static” approach is probably best. At least, there are many considerations and problems to solve to make it work efficiently with a model in the loop.

import prodigy
from prodigy.components.loaders import JSONL
  
  @prodigy.recipe('sentiment',
    dataset=prodigy.recipe_args['dataset'],
    file_path=("Path to texts", "positional", None, str))
  def sentiment(dataset, file_path):
      """Annotate the sentiment of texts using different mood options."""
      stream = JSONL(file_path)     # load in the JSONL file
      stream = add_options(stream)  # add options to each task
  
      return {
          'dataset': dataset,   # save annotations in this dataset
          'view_id': 'choice',  # use the choice interface
          'config': {'choice_style': 'multiple'},
          'stream':stream,
          'config': {'choice_style': 'multiple'}
      }
  
  def add_options(stream):
      """Helper function to add options to every task in a stream."""
      options = [{'id': 'skill', 'text': '😀skill'},
                 {'id': 'experience', 'text': 'experience'}]
      for task in stream:
          task['options'] = options
          yield task

prodigy sentiment sentences.jsonl -F recipe.py

image
I am still not getting multiple choice options
please can you help me with that

@vajja Hmm, that’s strange – I just ran your exact recipe and here’s how the options look for me:

One thing that might be happening: Check your .prodigy/prodigy.json file and see if it has a "choice_style": "single" setting in it. If so, remove it and try again. The settings in your prodigy.json are the global settings, so those currently override the defaults set in the recipes.

Thanks, I updated prodigy.json and it working

1 Like

Hi Ines, thank you for your work. In this case "choice_style": "multiple", it is really nice to have the keyboard shortcuts!! I use those or click on the radio buttons to select. However, what do the standard Prodi.gy GREEN, RED, IGNORE, UNDO mean in this case?? (see https://prodi.gy/demo?view_id=textchoice)

They seem undefined especially if I have 2+ classes and GREEN/RED is binary. Two questions:

  • What should these buttons mean to me/my annotators? They’re too distracting to ignore.
  • I’m not a UI programmer (which is why I love Prodi.gy) how to get rid of them easily? to make my UI nice and clean (single-minded on the multi-choice task)

:pray:t5::bowing_man:t5:

Thanks! :pray:

By default, the buttons are consistent and the same across all interfaces. For thechoice interface, “accept” is of course there to submit the answer, and “undo” can be used to go back to the previous example, e.g. if you’ve made a mistake and want to correct your answer. “ignore” is typically used to skip an example for whatever reason – for instance, if the annotator finds the question confusing. This makes it easy to go back over those examples separately later on, clear up confusion, see what the problems are, reannotate them etc.

The “reject” button is a bit less clearly defined in this case. In interfaces like ner.manual, users often use it to specifically mark examples that are wrong and/or broken. For instance, if the tokenization is messy and the desired span can’t be selected. It can also be used to create negative examples. For instance, in the new textcat.manual workflow, we also use the multiple choice interface. If you select a label and reject the task, you can basically say: I know that this label is incorrect.

That said, I do see you point for hiding the buttons that would otherwise really confuse the annotators. However, I’d suggest to only hide the “reject” button in that case – all other buttons do have their place and I think you want to keep the actions for “ignore” and “undo”.

Yes, that should hopefully be very straightfgorward with a bit of CSS in the "global_css" setting in the "config" returned by your recipe (where you also put the "choice_style" setting). Whenever you start a recipe, the main page will receive data attributes with the current recipe name and interface ID. This lets you apply styles only for specific recipes or interfaces. The button row has the class .prodigy-buttons, so you can target that as well.

Visually, there are two options here: 1) Hide the button completely. 2) Make it grey and unclickable. The first option is easier, the second a bit more consistent, because the annotators do not have to get used to the buttons being in different positions. So if you’re clicking, the buttons aren’t suddenly in a different spot.

Here’s the CSS for hiding the button – it’s only applied if the Prodigy view ID is "choice" and will hide the second button in the button row:

[data-prodigy-view-id="choice"] .prodigy-buttons button:nth-child(2) {
    display: none;
}

Here’s the code for making it grey and unclickable:

[data-prodigy-view-id="choice"] .prodigy-buttons button:nth-child(2) {
    background: #b9b9b9 !important;  /* make it grey */
    opacity: 0.5;  /* make it half transparent */
    cursor: not-allowed;  /* show a "not allowed" cursor  on hover */
    pointer-events: none;  /* disable clicking */
}
1 Like

Thank you for the explanation and options to try out! :pray:t5: I will report when complete! :bowing_man:t5:

1 Like