Use previous annotations for new dataset


I created a recipe with prodigy. Three annotators labeled the sentences (tweets) with 3 (of 12) possible labels (multi-label).

I used the db-out command for export their annotations (one per session). I have three jsonl files with the annotations. Now i need to annotate a new dataset (csv file), the same task but with new unlabaled tweets, however, i want to exclude the already annotate data of the first round for each user session (maybe can find some duplicates with the new dataset). It is possible in a simple way (prodigistic way :wink:?)

By the way, how i can calculate with prodigy the inter-annotator agreement with exported sessions dataset with labeled data?

Thank you very much!

Hi! I think the easiest solution would be to use the --exclude argument with a comma-separated list of dataset or session dataset names (assuming the data you've exported also lives in your database).

If you're using a custom recipe, make sure you're setting the "exclude_by" config so it reflects how you want examples to be excluded: by_task will compare the task hashes (so may be asked a different question about the same text but never the same) and by_input will compare the input hashes (so you're never asked about the same text twice).

If you want to do something more custom, you could also set up your own exclude logic in your custom recipe. You don't even necessarily need the complete examples that you previously annotated – just the hashes. In your stream, you can then use Prodigy's set_hashes to add hashes to the incoming examples, and check if they're already in your existing hashes. If so, you can skip them. This approach can also be helpful if you want to compile some extra stats – for example, you could keep a count of the duplicates and print the counts of new / skipped examples at the end of the annotation process.

The underlying mechanism here is actually pretty similar: using the input hashes, you can find all annotations on the same text. Each example contains the _session_id so you know which session it belongs to / who annotated it. So all you have to do is compare two annotations on the same input from different sessions. The only thing you have to decide is how you want to calculate your agreement and what you consider agreement: for example, for a text classification task, you could do this by label. If I annotate A and B, and you annotate B and C, you could consider this a 50% agreement (or 0%).

The review workflow might be relevant as well: It uses the hashes to merge all annotations on the same input, presents them to you with the session information and asks you for the final correct decision. Based on that data, you can then calculate whether the sessions agreed with each other, and ultimately, whether they agreed with you. You don't eve necessarily have to do this for all the data – even a random sample every once in a while can give you some useful insights.

1 Like

Hi @ines, My apologies for the delay ... Thank you very much for the lights, with your answer and inspired in this:

..and this:

...I started separate, isolated instances for each annotator (session). For every instance, I filtered by a unique identifier I have in the meta field of each example, that I was able to compare the incoming examples against.
I love Prodigy support page, was very useful for me :slight_smile:

For the agreement, I’ll definitely try that...and return my feedback!