Managing annotations/datasets

For a large annotation project, it has occasionally happened that we wind up with duplicate inputs in a dataset (that is, the same text being classified on the same label). Usually this winds up with two of the same decision for that input task (accept/reject), but occasionally for tough annotations, we can wind up with two different decisions (one accept and one reject).

The only solution I’ve come up with so far is to write a set a set of SQL views and triggers to allow viewing the contents of the JSON field, and editing the data in that view, but doing that is going to be a bit of work.

Are there any somewhat painless ways to deal with a dataset that might’ve gathered these duplicate (and potentially conflicting) annotations?

1 Like

Handling conflicting annotations can be tricky, because a lot of it also comes down to finding the best general strategy to resolve those conflicts. If you’re dealing with a large volume of annotations where the value of the individual example is lower, it can make sense to use a general policy that only includes annotations in your training set if at least X% of your annotators agree.

If you want to review the conflicting annotations, automating as much as possible is always good. You should actually be able to come up with a pretty straightforward review process using Prodigy and a script that programmatically creates your final training set. Prodigy’s JSONL format is usually pretty convenient here, because it’s easy to analyse and manipulate in Python.

For example, if you’re doing text classification, you could extract the conflicting annotations from your dataset and write a custom recipe that presents those tasks in the "choice" interface, similar to the example here. This lets you view the different label options and select the one that you think should be the correct answer.

def get_stream():
    for examples in conflicting_annotations:
        text = examples[0]['text']
        orig_input_hash = examples[0]['_input_hash']
        options = [{'text':  eg['label'], 'id': eg['label']} for eg in examples]
        task = {'text': text, 'orig_input_hash': orig_input_hash, 'options': options}
        yield task

This will produce annotation data in the following format:

{
    "text": "Some text here",
    "options": [
        {"text": "FOOTBALL", "id": "FOOTBALL"}, 
        {"text": "SOCCER", "id": "SOCCER"}
    ],
    "orig_input_hash": 1234567,
    "answer": "accept",
    "accept": ["SOCCER"]
}

The "accept" list holds the IDs of the accepted options (only one in single-choice mode). Using the orig_input_hash, you’ll be able to relate the annotations back to the examples in your dataset. You can also use your own ID system here. How you do it is really not that important – you only have to be able to find the examples again later on.

You can then filter the conflicting annotations from your dataset, iterate over your review annotations, find the example, apply the correct label, add it to your final training set and export the data.

1 Like