I'm trying to use the ner.teach
recipe with exclude to annotate a very large dataset over several sessions without duplicating annotation tasks. For some reason, whenever I start a new session, --exclude
fails to exclude annotations I've done in a previously annotated dataset.
It could be I'm misunderstanding how exclude or teach should work, but I'll explain what I expect.
Here's some test data that can replicate this example:
{"text":" Did you get a gnarly headache?","meta":{"id":"em6li1v"}}
{"text":" I still get miserable insomnia (unavoidable due to the spiked Dopamine levels) and the occasional aches and lethargy, but other than that I have virtually no withdrawals.","meta":{"id":"em6kz4v"}}
I have patterns matching symptoms as following:
{"label": "SYMPTOM", "pattern": [{"LOWER": "headache"}]}
{"label": "SYMPTOM", "pattern": [{"LOWER": "insomina"}]}
{"label": "SYMPTOM", "pattern": [{"LOWER": "aches"}]}
{"label": "SYMPTOM", "pattern": [{"LOWER": "lethargy"}]}
So I start up with the following command:
prodigy ner.teach test_1 en_core_web_lg prodigy_data/test.jsonl --label SYMPTOM --patterns prodigy_data/patterns.jsonl
It gives me 4 tasks and then says there's nothing else in the queue.
I save the annotations, then run the following command, which I would expect to tell me that there's nothing in the queue, since I've annotated all the examples and am now excluding the test_1
dataset:
prodigy ner.teach test_2 en_core_web_lg prodigy_data/test.jsonl --label SYMPTOM --patterns prodigy_data/patterns.jsonl --exclude test_1
But, I'm back at the start with the first example from the source:
I would expect this to say "no tasks available" since I told it to exclude what's already been annotated from the test_1
dataset. Anyway, for further experimentation I completed this session and saved the results in dataset test_2
.
Running db-out
on both datasets tells me that the examples have the same _input_hash
and _task_hash
as well.
Running prodigy db-merge test_1,test_2 test_3
doesn't merge any of the examples, and then prodigy db-out test_3 test_3_out
tells me that there's 8 (expecting 4) examples in the merged dataset.
Some other posts I've looked at to try and figure this out:
- Continue to annotate same data in new session
- --exclude is not working for ner.make-gold on same dataset
- Duplicated examples in NER.teach & large jsonl files
I'm a little stumped here. Is this the intended behavior of --exclude
and I'm missing part of the wider annotation process? Are the patterns overriding examples that would be excluded? Is it an issue with the hashing logic?
For more context, my actual use case is I have a very large source dataset I want to annotate over several sessions, starting with ner.teach
, and I don't want to repeat an annotation for an example I've already seen.
edit: I'm on prodigy 1.10.1 on macOS.