By default, Prodigy doesn’t make any assumptions about this and will let you re-annotate the same task. But you can tell it to exclude annotations of existing datasets by setting --exclude dataset_name
(or multiple, comma-separated names). This is also very useful when creating evaluation sets.
The tasks are compared bashed on their hahes. When a new annotation task comes in, Prodigy assigns an "_input_hash"
to the task, based on its content – by default, properties like "text"
. When you run ner.teach
, Prodigy will add "spans"
to each task containing the entity you’re annotating. The input hash and the annotation features are then hashed again to create a _task_hash
, which is used to determine whether two annotation tasks are the same.
This means that Prodigy will exclude tasks asking the same questions – but still allow different questions about the same text that you haven’t answered before. You can find more details on the hashing in the PRODIGY_README.html
, for example in the API docs of the set_hashes
helper function.