Continue bert.ner.manual annotating where I left of

I am trying to annotate some text with the custom recipe bert.ner.manual and after I do a couple of annotations, save them and stop the prodigy server and restart it, it does not start the annotations from where I left of, but starts from scratch again.

I looked at the ner.manual recipe and can't find anything special about how it does that, so how could I make the custom recipe "bert.ner.manual" to behave the same?

I tried the solution seen here: --exclude is not working for ner.make-gold on same dataset but it did not work and seems old anyway.

hi @thondeboer!

Thanks for your message and welcome to the Prodigy community :wave:

It's a bit hard to know without seeing the recipe. However, it's sounding like hashing isn't working correctly.

One simple way to test would be to change exclude_by: input in your configuration (either prodigy.json, override, or returned in your recipe). This will try to hash by input, instead of task.

But another possibility is that you're not even hashing to begin with.

Like bert.ner.manual, are you loading your input source like:

stream = get_stream(source, loader=loader, input_key="text")

Can you instead try add rehash = True and dedup = True (not needed, but default behavior):

stream = get_stream(source, loader=loader, input_key="text", rehash = True, dedup = True)

This may be a typo in that recipe. Typically, built in recipes use get_stream with the arguments rehash = True and dedup = True.

Alternatively, you could add in hashes using set_hashes. This is what the rehash = True flag does.

1 Like

HI, I am using the recipe as provided by the prodigy-recipes repo on github (prodigy-recipes/transformers_tokenizers.py at master · explosion/prodigy-recipes · GitHub).

It does not seem to use those rehash=True and dedup=True options and adding them indeed solved the issue...It is now correctly starting where I left off...

I did not see those options in the default ner.manual recipe in the github entry, since it is using stream = JSONL(source) which I am guessing is already doing the filtering itself.