Hi,
I was creating annotations manually for my dataset which is in jsonl format.
I have a question here. Lets say I close my session and start again in a few hours. Does Prodigy make sure (in the new session) that it selects records which have not been annotated already? Thanks.
Hi! Prodigy will skip incoming examples that are already saved in the current dataset – so if you're starting a new session with the same dataset name, you should only see examples that haven't been annotated yet.
Under the hood, Prodigy uses hashes to determine whether an incoming example is the same question, or a different question about the same data and will filter accordingly, depending on the recipe. You can read more about the mechanism here: Loaders and Input Data · Prodigy · An annotation tool for AI, Machine Learning & NLP
Thanks Ines!
I used following command to create annotation:
prodigy ner.teach news_de_v1.0 de_core_news_lg ./news_de.json
When the next day, I restarted the same work, the same questions appeared.
What did I wrong ?
Thanks, Diego
@ines Is there a way to make prodigy not exclude duplicates? Let's say I have intentionally included two identical examples to later assess intra-coder agreement. In that case I would want to prevent prodigy from excluding based the _input_has
.
While it's a bit of a hack, you could include the annotator information in the task_hash
. This way, each annotator would be part of the task definition. This comes with some downsides, because you cannot identify each task on it's own anymore without the annotator info, but theoretically it'd do what you're asking for.
This task_hash
is documented on our docs here. In case it's of interest, the difference between the input and task hash is explained in detail here:
If you're working on a custom recipe you should be able to use the set_hashes function get the behavior. In your case, I imagine it would look something like:
from prodigy import set_hashes
stream = (add_annotator_info(eg, annotator_name) for eg in stream)
stream = (set_hashes(eg, input_keys=("text",), task_keys=("label", "options", "annotator")) for eg in stream)
Just be aware
That said, I do want to stress that this is a bit of a hack. If you're going to be working with multiple annotators you'd also want to have a system that can regulate the annotator overlap and this sort of functionality is planned for the Prodigy Teams product. A thread on this product, which is still in development, can be found here:
@koaning thanks for the considerate reply Vincent! Currently, I have a single prodigy server running per annotator, so mediating between different annotators during labeling is not an issue.
I now rely on the following workaround (similar to your proposal):
- Add an additional field to the source
.jsonl
file (calledDUPL
) that indicates whether it is the first or second appearance of a given sentence. (could have maybe also done it within the recipe using a function similar toadd_annotator_info
) - Use this additional meta field to compute a new, unique hash per example:
stream = JSONL(source)
stream = (prodigy.set_hashes(eg, input_keys=('text', 'DUPL')) for eg in stream)
The way I see it, this solves the issue. Maybe two quick follow-ups:
- Is there any argument for integrating the additional meta field in the
input_keys
vs.task_keys
hash? - Isn't the assessment of intra-coder reliability a common use case? Just wondering, bc it seems like this is a somewhat hacky workaround that could be more easily integrated into prodigy.
Thanks for taking the time and helping out!
Two short answers!
Question 1
One idea behind havinginput_keys
and task_keys
seperately is so that you can always add a new label later. You can, for example, start with two classes in a classification problem but always add a 3rd one easily. If the input_keys
were merged with the task_keys
you wouldn't be able to do that.
Hopefully, this argument also paints a picture of warning. While nothing is stopping you from adding whatever info you like, you should try and think about future changes that might be impacted. If the metadata really makes it a new task, or a new training example, then you can consider adding it. If not, you risk loosing the ability to make each example unique later.
Question 2
Annotator agreement is indeed a common theme/problem in our space. If you're comfortable with Python you can always implement your own solution but features surrounding agreement are planned for Prodigy Teams.
If you'd like to write your own Python solution, you'd might get away with a groupby(input_key, task_key)
to find instances of annotator disagreement. This assumes however that the labels/task never change.
Actually! Just to check; are you aware of the session
mechanic? The one explained here:
Here's what db-out
would look like for a simple text classification usecase.
{"text":"this is a single example yo","_input_hash":-465404500,"_task_hash":221834242,"label":"demo","_view_id":"classification","answer":"accept","_timestamp":1666776571,"_annotator_id":"issue-6042-foobar","_session_id":"issue-6042-foobar"}
{"text":"this is a single example yo","_input_hash":-465404500,"_task_hash":221834242,"label":"demo","_view_id":"classification","answer":"accept","_timestamp":1666776577,"_annotator_id":"issue-6042-vincent","_session_id":"issue-6042-vincent"}
That should give you access to _annotator_id
as well. Isn't that what you'd want?
That should give you access to
_annotator_id
as well. Isn't that what you'd want?
Currently, we are running individual prodigy servers for each annotator, which is why the _annotator_id
never occurs. Actually, I think I'll try reimplementing the workflow using multi-user sessions which appears way more elegant.
However, my initial concern/issue was simply about intra-coder (instead of inter-coder) agreement, that is, how consistent a single annotator is over time; thats why I was trying to mix in duplicate inputs (for the same annotator) to assess that consistency. And I believe set_hashes
offered me a nice and clean solution for that by simply mixing in metadata about whether or not that is the first, second, or third appearance of a given sentence. That way, the same sentence wasn't filtered out by prodigy.
Long story short: The problem is solved! Thanks for providing context and helping out along the way!