Restarting Prodigy with a new session

RohitRanga · February 24, 2020, 7:05am

Hi,
I was creating annotations manually for my dataset which is in jsonl format.
I have a question here. Lets say I close my session and start again in a few hours. Does Prodigy make sure (in the new session) that it selects records which have not been annotated already? Thanks.

ines · February 24, 2020, 11:42am

Hi! Prodigy will skip incoming examples that are already saved in the current dataset – so if you're starting a new session with the same dataset name, you should only see examples that haven't been annotated yet.

Under the hood, Prodigy uses hashes to determine whether an incoming example is the same question, or a different question about the same data and will filter accordingly, depending on the recipe. You can read more about the mechanism here: Loaders and Input Data · Prodigy · An annotation tool for AI, Machine Learning & NLP

RohitRanga · February 28, 2020, 7:44am

Thanks Ines!

DiegoMartinaglia · July 9, 2020, 3:39pm

I used following command to create annotation:
prodigy ner.teach news_de_v1.0 de_core_news_lg ./news_de.json

When the next day, I restarted the same work, the same questions appeared.
What did I wrong ?

Thanks, Diego

simonschoe · October 21, 2022, 11:16am

@ines Is there a way to make prodigy not exclude duplicates? Let's say I have intentionally included two identical examples to later assess intra-coder agreement. In that case I would want to prevent prodigy from excluding based the _input_has.

koaning · October 24, 2022, 9:02am

While it's a bit of a hack, you could include the annotator information in the task_hash. This way, each annotator would be part of the task definition. This comes with some downsides, because you cannot identify each task on it's own anymore without the annotator info, but theoretically it'd do what you're asking for.

This task_hash is documented on our docs here. In case it's of interest, the difference between the input and task hash is explained in detail here:

If you're working on a custom recipe you should be able to use the set_hashes function get the behavior. In your case, I imagine it would look something like:

from prodigy import set_hashes

stream = (add_annotator_info(eg, annotator_name) for eg in stream)
stream = (set_hashes(eg, input_keys=("text",), task_keys=("label", "options", "annotator")) for eg in stream)

Just be aware

That said, I do want to stress that this is a bit of a hack. If you're going to be working with multiple annotators you'd also want to have a system that can regulate the annotator overlap and this sort of functionality is planned for the Prodigy Teams product. A thread on this product, which is still in development, can be found here:

simonschoe · October 25, 2022, 6:04am

@koaning thanks for the considerate reply Vincent! Currently, I have a single prodigy server running per annotator, so mediating between different annotators during labeling is not an issue.

I now rely on the following workaround (similar to your proposal):

Add an additional field to the source .jsonl file (called DUPL) that indicates whether it is the first or second appearance of a given sentence. (could have maybe also done it within the recipe using a function similar to add_annotator_info)
Use this additional meta field to compute a new, unique hash per example:

stream = JSONL(source)
stream = (prodigy.set_hashes(eg, input_keys=('text', 'DUPL')) for eg in stream)

The way I see it, this solves the issue. Maybe two quick follow-ups:

Is there any argument for integrating the additional meta field in the input_keys vs. task_keys hash?
Isn't the assessment of intra-coder reliability a common use case? Just wondering, bc it seems like this is a somewhat hacky workaround that could be more easily integrated into prodigy.

Thanks for taking the time and helping out!

koaning · October 26, 2022, 9:21am

Two short answers!

Question 1

One idea behind havinginput_keys and task_keys seperately is so that you can always add a new label later. You can, for example, start with two classes in a classification problem but always add a 3rd one easily. If the input_keys were merged with the task_keys you wouldn't be able to do that.

Hopefully, this argument also paints a picture of warning. While nothing is stopping you from adding whatever info you like, you should try and think about future changes that might be impacted. If the metadata really makes it a new task, or a new training example, then you can consider adding it. If not, you risk loosing the ability to make each example unique later.

Question 2

Annotator agreement is indeed a common theme/problem in our space. If you're comfortable with Python you can always implement your own solution but features surrounding agreement are planned for Prodigy Teams.

If you'd like to write your own Python solution, you'd might get away with a groupby(input_key, task_key) to find instances of annotator disagreement. This assumes however that the labels/task never change.

koaning · October 26, 2022, 9:32am

Actually! Just to check; are you aware of the session mechanic? The one explained here:

Here's what db-out would look like for a simple text classification usecase.

{"text":"this is a single example yo","_input_hash":-465404500,"_task_hash":221834242,"label":"demo","_view_id":"classification","answer":"accept","_timestamp":1666776571,"_annotator_id":"issue-6042-foobar","_session_id":"issue-6042-foobar"}
{"text":"this is a single example yo","_input_hash":-465404500,"_task_hash":221834242,"label":"demo","_view_id":"classification","answer":"accept","_timestamp":1666776577,"_annotator_id":"issue-6042-vincent","_session_id":"issue-6042-vincent"}

That should give you access to _annotator_id as well. Isn't that what you'd want?

simonschoe · October 28, 2022, 7:24am

That should give you access to _annotator_id as well. Isn't that what you'd want?

Currently, we are running individual prodigy servers for each annotator, which is why the _annotator_id never occurs. Actually, I think I'll try reimplementing the workflow using multi-user sessions which appears way more elegant.

However, my initial concern/issue was simply about intra-coder (instead of inter-coder) agreement, that is, how consistent a single annotator is over time; thats why I was trying to mix in duplicate inputs (for the same annotator) to assess that consistency. And I believe set_hashes offered me a nice and clean solution for that by simply mixing in metadata about whether or not that is the first, second, or third appearance of a given sentence. That way, the same sentence wasn't filtered out by prodigy.

Long story short: The problem is solved! Thanks for providing context and helping out along the way!

Topic		Replies	Views
Presenting the same annotation task multiple times ner , solved	3	948	April 12, 2020
Multi-user sessions and excluding annotations by session enhancement , usage , streams	7	1678	December 25, 2019
Restarting prodigy on same dataset doesn't skip completed tasks (custom recipe)	3	355	October 5, 2022
Continue to annotate same data in new session enhancement , done	19	4002	October 5, 2018
Duplicates in revised annotations usage	2	574	May 29, 2019

Restarting Prodigy with a new session

Just be aware

Question 1

Question 2

Related topics