Is the my_set
dataset you're using with ner.make-gold
the same one you created previously with ner.manual
? And if I understand it correctly, you only want to annotate examples from sentences.csv
that aren't already in the dataset and previously annotated with ner.manual
?
I think what's happening here is that the exclude logic compares examples based on the task hashes – so basically, the identifier of the input text plus the pre-set spans, labels etc. if available. Because ner.manual
starts with no entities and ner.make-gold
might suggest some, the hashes will be different.
Sorry if this sounds a little abstract – here's an example: Let's say you run ner.manual
and the first example is the sentence "I use Prodigy" and no entity spans. When the example comes in, it will receive the hash 123
. When you start a new session with ner.make-gold
, the model might predict something and add a span to the text. So the sentence "I use Prodigy" comes in an the model predicts "Prodigy" as a product entity. Based on the text and the entity span, that task will receive the hash 456
. So as far as Prodigy is concerned, this example is different from the one with the hash 123
that already exists in the set, so it's not skipped.
If you're running a recipe like ner.teach
that asks you for feedback on suggestions, this makes a lot of sense because you don't want to see text + entity suggestion again if you've already annotated it. But you do want to see a different entity suggestion on the same text, because that's a completely different question.
Solution idea: (partly note to self) If you create gold-standard data, this is different and you usually want to only annotate the same input once. To solve this, we could consider adding an exclude_type
config option that's either "task"
(default) or "input"
and specifies whether tasks should be excluded bashed on the task hash or input hash. Recipes that allow manual editing could default to "input"
.
In the meantime, you could use the filter_inputs
helper to add your own filter that removes all examples from the stream that have the same input hash as examples already in your dataset – so basically, examples with the same text:
from prodigy.components.db import connect
from prodigy.components.filters import filter_inputs
db = connect() # uses settings from prodigy.json
input_hashes = db.get_input_hashes('my_set') # get hashes
# at the end of the recipe function
stream = filter_inputs(stream, input_hashes)