Use previous annotations for new dataset

mmaguero · February 1, 2021, 12:17am

Hello,

I created a recipe with prodigy. Three annotators labeled the sentences (tweets) with 3 (of 12) possible labels (multi-label).

I used the db-out command for export their annotations (one per session). I have three jsonl files with the annotations. Now i need to annotate a new dataset (csv file), the same task but with new unlabaled tweets, however, i want to exclude the already annotate data of the first round for each user session (maybe can find some duplicates with the new dataset). It is possible in a simple way (prodigistic way ?)

By the way, how i can calculate with prodigy the inter-annotator agreement with exported sessions dataset with labeled data?

Thank you very much!

ines · February 2, 2021, 3:34am

Hi! I think the easiest solution would be to use the --exclude argument with a comma-separated list of dataset or session dataset names (assuming the data you've exported also lives in your database).

If you're using a custom recipe, make sure you're setting the "exclude_by" config so it reflects how you want examples to be excluded: by_task will compare the task hashes (so may be asked a different question about the same text but never the same) and by_input will compare the input hashes (so you're never asked about the same text twice).

If you want to do something more custom, you could also set up your own exclude logic in your custom recipe. You don't even necessarily need the complete examples that you previously annotated – just the hashes. In your stream, you can then use Prodigy's set_hashes to add hashes to the incoming examples, and check if they're already in your existing hashes. If so, you can skip them. This approach can also be helpful if you want to compile some extra stats – for example, you could keep a count of the duplicates and print the counts of new / skipped examples at the end of the annotation process.

The underlying mechanism here is actually pretty similar: using the input hashes, you can find all annotations on the same text. Each example contains the _session_id so you know which session it belongs to / who annotated it. So all you have to do is compare two annotations on the same input from different sessions. The only thing you have to decide is how you want to calculate your agreement and what you consider agreement: for example, for a text classification task, you could do this by label. If I annotate A and B, and you annotate B and C, you could consider this a 50% agreement (or 0%).

The review workflow might be relevant as well: Built-in Recipes · Prodigy · An annotation tool for AI, Machine Learning & NLP It uses the hashes to merge all annotations on the same input, presents them to you with the session information and asks you for the final correct decision. Based on that data, you can then calculate whether the sessions agreed with each other, and ultimately, whether they agreed with you. You don't eve necessarily have to do this for all the data – even a random sample every once in a while can give you some useful insights.

mmaguero · February 14, 2021, 7:20pm

Hi @ines, My apologies for the delay ... Thank you very much for the lights, with your answer and inspired in this:

..and this:

...I started separate, isolated instances for each annotator (session). For every instance, I filtered by a unique identifier I have in the meta field of each example, that I was able to compare the incoming examples against.
I love Prodigy support page, was very useful for me

For the agreement, I’ll definitely try that...and return my feedback!

Topic		Replies	Views
Multi-user sessions and excluding annotations by session enhancement , usage , streams	7	1678	December 25, 2019
Resume Annotation Session with Prodigy - Text Classification textcat	1	1641	June 14, 2018
Restarting Prodigy with a new session usage , solved	9	1991	October 28, 2022
Exclude for custom_recipes - what am I missing? usage , done , solved	7	1964	July 29, 2020
Adding data to a Prodigy dataset using db-in - is there a way to filter out/remove duplicate annotations? usage , solved	2	418	January 4, 2023

Use previous annotations for new dataset

Related topics