--exclude is not working for ner.make-gold on same dataset

Hi - it appears --exclude is not working for ner.make-gold when passing the same dataset to --exclude that I am annotating. It does not exclude items examples from the passed dataset. Normally, I use this method with ner.manual to continue annotating where I left off without reannotating the same thing again.

prodigy ner.manual my_set my_model sentences.csv --exclude my_set

Does not work:
prodigy ner.make-gold my_set my_model sentences.csv --exclude my_set

Is the my_set dataset you’re using with ner.make-gold the same one you created previously with ner.manual? And if I understand it correctly, you only want to annotate examples from sentences.csv that aren’t already in the dataset and previously annotated with ner.manual?

I think what’s happening here is that the exclude logic compares examples based on the task hashes – so basically, the identifier of the input text plus the pre-set spans, labels etc. if available. Because ner.manual starts with no entities and ner.make-gold might suggest some, the hashes will be different.

Sorry if this sounds a little abstract – here’s an example: Let’s say you run ner.manual and the first example is the sentence “I use Prodigy” and no entity spans. When the example comes in, it will receive the hash 123. When you start a new session with ner.make-gold, the model might predict something and add a span to the text. So the sentence “I use Prodigy” comes in an the model predicts “Prodigy” as a product entity. Based on the text and the entity span, that task will receive the hash 456. So as far as Prodigy is concerned, this example is different from the one with the hash 123 that already exists in the set, so it’s not skipped.

If you’re running a recipe like ner.teach that asks you for feedback on suggestions, this makes a lot of sense because you don’t want to see text + entity suggestion again if you’ve already annotated it. But you do want to see a different entity suggestion on the same text, because that’s a completely different question.

Solution idea: (partly note to self) If you create gold-standard data, this is different and you usually want to only annotate the same input once. To solve this, we could consider adding an exclude_type config option that’s either "task" (default) or "input" and specifies whether tasks should be excluded bashed on the task hash or input hash. Recipes that allow manual editing could default to "input".

In the meantime, you could use the filter_inputs helper to add your own filter that removes all examples from the stream that have the same input hash as examples already in your dataset – so basically, examples with the same text:

from prodigy.components.db import connect
from prodigy.components.filters import filter_inputs

db = connect()  # uses settings from prodigy.json
input_hashes = db.get_input_hashes('my_set')  # get hashes

# at the end of the recipe function
stream = filter_inputs(stream, input_hashes)

Thank you! This worked like a charm. For anyone else who needs it, here is the complete recipe, which I called extend-gold. It allows you to continue annotating a gold corpus where you left off, but to use a model to help speed up the annotation process.

import prodigy
from prodigy import recipe, recipe_args
from prodigy.util import log
from prodigy.recipes.ner import make_gold
from prodigy.components.db import connect
from prodigy.components.filters import filter_inputs

def extend_gold(dataset, spacy_model, source=None, api=None, loader=None,
              label=None, exclude=None, unsegmented=False):
    Create gold data for NER by correcting a model's suggestions.
    log("RECIPE: Starting recipe ner.extend-gold", locals())

    result = make_gold(dataset, spacy_model, source=source, api=api, loader=loader, 
        label=label, exclude=exclude, unsegmented=unsegmented)

    log("RECIPE: filtering inputs", locals())
    db = connect()  # uses settings from prodigy.json
    input_hashes = db.get_input_hashes(dataset)  # get hashes   
    result['stream'] = filter_inputs(result['stream'], input_hashes) # remove any inputs which have already been annotated
    return result

Hi! Is there any update on the exclude_type flag? That would be very useful :smile:

1 Like