Recipe for labeling pairs of annotation tasks or entire annotation task

Hello Prodigy community,

I have an annotation task that should be fairly simple, but doesn’t appear to be supported out of the box. I’m hoping somebody who has dealt with something similar can help out.

For this task, I want to compare two strings of text and assign one of five labels to the pair. Let’s say these are 1, 2, 3, 4, or 5. Only one label will ever be assigned to the pair and every pair will have a label. Ideally, I would like it if Prodigy could randomly draw two samples from a dataset and avoid pairs that have already been selected (similar to the --memorize flag that the mark recipe can take). Since I will just be labeling a relationship between the pair (and not tokens within them), I would like to be able to just hit the number button corresponding to the correct label and automatically move to the next pair.

If that’s the best-case scenario, here’s the workaround I’ve been able to come up with. I could randomly pair and concatenate strings outside of Prodigy and prepend that with some sort of flag token (like “label”). I could then use ner.manual with my five labels, and for each concatenated string select the appropriate label and highlight just the word “label”. I could then export my annotations and programmatically label each of them based on that consistent information. This is obviously not as good of a solution since it requires selecting a label, actually highlighting a token, and a considerable amount of work with text files outside of Prodigy.

Please let me know if you’ve solved a similar problem or if you have ideas on how to approach this.

Thank you,

Tyler

From what you describe, this sounds like it could work well as a choice task? You can loop over your texts, generate all possible pairs and create annotation tasks for each of them. Then you add your labels as "options", so they become selectable multiple-choice options. To make sure you’re only sending out a pair once, you can keep a record of pairs that were already seen. Ideally, you don’t want to be storing the whole text there and use a hash instead.

Here’s an example of how you could put together the tasks:

from spacy.strings import hash_string  # or any other hash function

labels = ["1", "2", "3", "4", "5"]
texts = [ ... ]  # all texts you want to pair up 
# Keep a record of pairs we've already seen
seen = set()
# Create a multiple choice option in Prodigy's format for each label
options = [{"id": label, "text": label} for label in labels]

def get_stream():
    for text1 in texts:
        hash1 = hash_string(text1)
        for text2 in texts:
            hash2 = hash_string(text2)
            if (hash1, hash2) not in seen:
                # Add both hash combinations to the seen examples
                seen.add((hash1, hash2))
                seen.add((hash2, hash1))
                # Create a HTML string of the two texts. This is what the
                # annotator will see. In addition, we'll also keep
                # the plain texts 1 and 2 in the data
                html = "{}<br /><br />{}".format(text1, text2)
                task = {"html": html, "text1": text1, "text2": text2, "options": options}
                yield task

I used the nested loops here to make it easier to follow – it might not be the most efficient solution for large lists of texts – but if you search for ways to create paris from lists in Python, you’ll find various implementations for this. This is definitely a solved algorithmic problem.

The above logic will create data in the following format. When you annotate the examples with the choice interface, Prodigy will add an "accept" key to the task, containing a list of the accepted option ID(s). So when you export the annotated data, it should be very straightforward to extract the text pairs and their selected label(s). The nice thing about the custom text1 and text2 keys we’ve added is that they’ll just be passed through when the data is saved to the database – this lets you attach custom meta, or plain-text versions of the data that make it easier to resolve the annotations later on.

{
    "html": "Text one<br /><br />Text two",
    "text1": "Text one",
    "text2": "Text two",
    "options": [{"id": "1", "text": "1"}, {"id": "2", "text": "2"}]  # etc.
}

In your custom recipe, you can set "view_id": "choice" and define more settings in the "config" returned by the recipe. For example, "choice_style": "multiple" will allow multiple selections, and "choice_auto_accept": True will automatically accept and submit a task once a label is selected.

Some example recipes from the prodigy-recipes repo for more inspiration:

  • question_answering.py – Annotate question/answer pairs with a custom HTML interface.
  • mark.py – Simplified version of the mark recipe with explanations.
1 Like