Duplicates in ner.correct using 1.11.0a8

daqieq · August 2, 2021, 6:58pm

Similar issue to Duplicates in ner.correct in 1.10.2, except I'm using SpaCy 3.0.6 and Prodigy 1.11.0a8. I'm getting roughly 20 examples to correct with each invocation of the ner.correct session.

Originally I tagged NER categories initially with ner.manual:

prodigy ner.manual data_ner2 en_core_web_lg sample.txt --loader txt --label NER_labels.txt --patterns combined_patterns.jsonl

After about 1700 examples - with training along the way - I decided to switch to ner.correct to have the model start helping with predictions. I reviewed my data set - data_ner2 - for overlaps/contentions in jupyter notebook (same prodigy environment) and added the corrected examples to a new data set:

db.add_dataset('data_ner2_reviewed')
db.add_examples(examples, datasets=['data_ner2_reviewed'])

This is how I used ner.correct:

prodigy ner.correct data_ner3 .\post-analysis-model\model-best\ sample.txt --loader txt --label NER_labels.txt --exclude data_ner2_reviewed

Now I'm concerned that I've done the TXT loader incorrectly ... Or maybe there's still an issue with ner.correct that limits the examples loaded from TXT files to ~20 per session.

Any ideas?

ines · August 3, 2021, 1:13am

Hi! Could you try and upgrade to the latest nightly? I think the latest release includes some fixes that might be relevant here.

And can you share some more details on the exact problem you're seeing? Are you asked about examples that you previously already annotated in your dataset?

daqieq · August 3, 2021, 8:56pm

Hi Ines! Yes, I intended to upgrade to latest nightly to double check, but I had a sprint wrapping today that I was working on and figured maybe this was a known issue that I couldn't find.
I upgraded to 1.11.0a11 today, but the issue persists. I'm seeing that each invocation of prodigy ner.correct ... only gives me 25 examples to correct (see above for exact invocation including the TXT loader).
The start of the training:

After the first 25 examples, it loops back through the first 25 examples:

Is this correct behavior? It doesn't feel like I'm getting very far just training on 25 examples. However, I might be missing some reason for this behavior as I just started using prodigy in the past 3 weeks.

daqieq · August 4, 2021, 10:03pm

I played around with this some more again today. The last test I ran was 130 annotations (looping after first 26 examples) that were saved to the SQLite database. However, when I exported the session using db-out there were only 30 annotation examples in the JSONL file.

This issue seems similar to this other one from earlier this year: ner.correct: Only 31 annotations to database no matter how many actually annotated everytime

I'm starting to suspect the TXT loader, but I can't examine the loaders.pyd file. I'll try to set up a test using TXT and JSONL loaders to see if I can replicate using different input files.

Edit for Update:
Well, with my tests below, I confirmed that it wasn't a 'direct' issue with the TXT data loader in ner.correct:

prodigy ner.correct test_ner .\Aug2-Sess1-model\model-best\ Jan_2021_Data_random.jsonl --label 2021_07_16_NER_labels2.txt --exclude 'alit_ner3,test_ner'
prodigy ner.correct test_ner2 .\Aug2-Sess1-model\model-best\ Jan_2021_Data_random.txt --loader txt --label 2021_07_16_NER_labels2.txt --exclude 'alit_ner3,test_ner2'

These are new data files and new datasets, and I ran both 2 times each to check the --exclude logic with empty and some data. I could not replicate the issue with new files and datasets.

However, the problem persists with the original TXT file and datasets. I tried running ner.correct with a different model path with the same result. I'm starting to suspect the existing datasets or the compare to the datasets to exclude examples might be the cause.

I'm going to 'start fresh' so that I can keep moving with my tagging, with more than 25 at a time.

Topic		Replies	Views
ner.correct examples repeat ner , done	5	398	December 30, 2021
ner.correct: Only 31 annotations to database no matter how many actually annotated everytime ner , database	3	579	March 9, 2021
ner.correct recipe : "No tasks available" after checking few instances usage , ner , spacy , solved	3	389	December 13, 2022
ner.correct --exclude not excluding duplicate tasks bug , ner	17	1827	December 7, 2021
No tasks available with ner.correct and prodigy 1.11.0 done , streams	6	397	August 17, 2021

Duplicates in ner.correct using 1.11.0a8

Related topics