--exclude is not working for ner.make-gold on same dataset

Is the my_set dataset you're using with ner.make-gold the same one you created previously with ner.manual? And if I understand it correctly, you only want to annotate examples from sentences.csv that aren't already in the dataset and previously annotated with ner.manual?

I think what's happening here is that the exclude logic compares examples based on the task hashes – so basically, the identifier of the input text plus the pre-set spans, labels etc. if available. Because ner.manual starts with no entities and ner.make-gold might suggest some, the hashes will be different.

Sorry if this sounds a little abstract – here's an example: Let's say you run ner.manual and the first example is the sentence "I use Prodigy" and no entity spans. When the example comes in, it will receive the hash 123. When you start a new session with ner.make-gold, the model might predict something and add a span to the text. So the sentence "I use Prodigy" comes in an the model predicts "Prodigy" as a product entity. Based on the text and the entity span, that task will receive the hash 456. So as far as Prodigy is concerned, this example is different from the one with the hash 123 that already exists in the set, so it's not skipped.

If you're running a recipe like ner.teach that asks you for feedback on suggestions, this makes a lot of sense because you don't want to see text + entity suggestion again if you've already annotated it. But you do want to see a different entity suggestion on the same text, because that's a completely different question.

Solution idea: (partly note to self) If you create gold-standard data, this is different and you usually want to only annotate the same input once. To solve this, we could consider adding an exclude_type config option that's either "task" (default) or "input" and specifies whether tasks should be excluded bashed on the task hash or input hash. Recipes that allow manual editing could default to "input".

In the meantime, you could use the filter_inputs helper to add your own filter that removes all examples from the stream that have the same input hash as examples already in your dataset – so basically, examples with the same text:

from prodigy.components.db import connect
from prodigy.components.filters import filter_inputs

db = connect()  # uses settings from prodigy.json
input_hashes = db.get_input_hashes('my_set')  # get hashes

# at the end of the recipe function
stream = filter_inputs(stream, input_hashes)