Exclude not functioning / duplicate tasks

I'm trying to use the ner.teach recipe with exclude to annotate a very large dataset over several sessions without duplicating annotation tasks. For some reason, whenever I start a new session, --exclude fails to exclude annotations I've done in a previously annotated dataset.

It could be I'm misunderstanding how exclude or teach should work, but I'll explain what I expect.

Here's some test data that can replicate this example:

{"text":" Did you get a gnarly headache?","meta":{"id":"em6li1v"}}
{"text":" I still get miserable insomnia (unavoidable due to the spiked Dopamine levels) and the occasional aches and lethargy, but other than that I have virtually no withdrawals.","meta":{"id":"em6kz4v"}}

I have patterns matching symptoms as following:

{"label": "SYMPTOM", "pattern": [{"LOWER": "headache"}]}
{"label": "SYMPTOM", "pattern": [{"LOWER": "insomina"}]}
{"label": "SYMPTOM", "pattern": [{"LOWER": "aches"}]}
{"label": "SYMPTOM", "pattern": [{"LOWER": "lethargy"}]}

So I start up with the following command:

prodigy ner.teach test_1 en_core_web_lg prodigy_data/test.jsonl --label SYMPTOM --patterns prodigy_data/patterns.jsonl

It gives me 4 tasks and then says there's nothing else in the queue.

I save the annotations, then run the following command, which I would expect to tell me that there's nothing in the queue, since I've annotated all the examples and am now excluding the test_1 dataset:

prodigy ner.teach test_2 en_core_web_lg prodigy_data/test.jsonl --label SYMPTOM --patterns prodigy_data/patterns.jsonl --exclude test_1

But, I'm back at the start with the first example from the source:

I would expect this to say "no tasks available" since I told it to exclude what's already been annotated from the test_1 dataset. Anyway, for further experimentation I completed this session and saved the results in dataset test_2.

Running db-out on both datasets tells me that the examples have the same _input_hash and _task_hash as well.

Running prodigy db-merge test_1,test_2 test_3 doesn't merge any of the examples, and then prodigy db-out test_3 test_3_out tells me that there's 8 (expecting 4) examples in the merged dataset.

Some other posts I've looked at to try and figure this out:

I'm a little stumped here. Is this the intended behavior of --exclude and I'm missing part of the wider annotation process? Are the patterns overriding examples that would be excluded? Is it an issue with the hashing logic?

For more context, my actual use case is I have a very large source dataset I want to annotate over several sessions, starting with ner.teach, and I don't want to repeat an annotation for an example I've already seen.

edit: I'm on prodigy 1.10.1 on macOS.

1 Like

Hey, thanks a lot for the super detailed report! :pray:

Before looking into this in more detail, I can definitely confirm what the intended behaviour is: ner.teach excludes by task hash, so two examples with the same task hash are considered duplicates and you should never be asked about them twice. The task hash is based on the text and span/label so you may be asked about different suggestions on the same text, but never about the same text + span + label combination. If an incoming example has the same task hash as an example in one of the excluded sets (via --exclude or the current dataset) and it's presented to you, that's a bug.

Other recipes, mostly the manual ones like ner.manual, exclude by input (via the "exclude_by": "input" config setting) because the assumption here is that you want to create one gold-standard annotation for each text and don't want to see the same text again, even if it comes in later with different pre-highlighted suggestions.

This thread made me a little suspicious about the --exclude option with a separate dataset. Although nothing really changed around this, so I'm not entirely sure where the problem would be :thinking: But it's probably the first thing we should double-check.

db-merge currently only appends and doesn't do any hashing/filtering/combining (that's currently only done during training and when you run data-to-spacy). So if you have 4 examples + 4 duplicates, you'll end up with one set of 8 examples.

1 Like

That makes sense, thanks for explaining.

One other question came up when I was thinking about why I was using exclude in the first place: when should I expect to re-annotate an example that exists in the data set? For example, take the following set of commands:

prodigy ner.teach test_dataset en_core_web_lg prodigy_data/test.jsonl --label SYMPTOM --patterns prodigy_data/patterns.jsonl

Complete the first annotation example. Save annotations. Restart with the same dataset.

prodigy ner.teach test_dataset en_core_web_lg prodigy_data/test.jsonl --label SYMPTOM --patterns prodigy_data/patterns.jsonl

Should I expect to re-annotate the same first example? This is currently what happens. Based on other explanations of the use of --exclude, I think (though not very confidently) this is expected behavior because prodigy isn't checking anything about the incoming example stream and the existing dataset. Is that right? Or should it be able to detect that an annotation in the dataset is the same task as an incoming example?

1 Like

Hey @pmbaumgartner,

You shouldn't need to re-annotate the same first example. I think this might be related to another thread. If you're willing to try out a beta version of a fix to see if it covers your use-case, send me an email at justin@explosion.ai. Thanks!

Thanks @justindujardin. Email just sent w/ subject Prodigy Beta Test for Duplicate Annotation Bug

1 Like

@justindujardin I was able to replicate the experiment above with the beta version and get the expected results. That is, I ran ner.teach to create a dataset called test_a, annotated 4 examples (the whole dataset), saved, and then reran ner.teach with a dataset called test_b, while now passing --exclude test_a, and the already annotated examples weren't present.

Additionally, I tested running the command on test_a again, without excluding anything, and it expectedly told me there were no tasks.

In summary, the beta version seemed to have fixed exclude as well as the issue where it wouldn't detect already annotated examples in the same source dataset.

1 Like

Just released v1.10.2, which should fix this! :tada:

1 Like