Similar issue to Duplicates in ner.correct in 1.10.2, except I'm using SpaCy 3.0.6 and Prodigy 1.11.0a8. I'm getting roughly 20 examples to correct with each invocation of the
Originally I tagged NER categories initially with
prodigy ner.manual data_ner2 en_core_web_lg sample.txt --loader txt --label NER_labels.txt --patterns combined_patterns.jsonl
After about 1700 examples - with training along the way - I decided to switch to
ner.correct to have the model start helping with predictions. I reviewed my data set - data_ner2 - for overlaps/contentions in jupyter notebook (same prodigy environment) and added the corrected examples to a new data set:
This is how I used
prodigy ner.correct data_ner3 .\post-analysis-model\model-best\ sample.txt --loader txt --label NER_labels.txt --exclude data_ner2_reviewed
Now I'm concerned that I've done the TXT loader incorrectly ... Or maybe there's still an issue with
ner.correct that limits the examples loaded from TXT files to ~20 per session.
Hi! Could you try and upgrade to the latest nightly? I think the latest release includes some fixes that might be relevant here.
And can you share some more details on the exact problem you're seeing? Are you asked about examples that you previously already annotated in your dataset?
Hi Ines! Yes, I intended to upgrade to latest nightly to double check, but I had a sprint wrapping today that I was working on and figured maybe this was a known issue that I couldn't find.
I upgraded to 1.11.0a11 today, but the issue persists. I'm seeing that each invocation of
prodigy ner.correct ... only gives me 25 examples to correct (see above for exact invocation including the TXT loader).
The start of the training:
After the first 25 examples, it loops back through the first 25 examples:
Is this correct behavior? It doesn't feel like I'm getting very far just training on 25 examples. However, I might be missing some reason for this behavior as I just started using prodigy in the past 3 weeks.
I played around with this some more again today. The last test I ran was 130 annotations (looping after first 26 examples) that were saved to the SQLite database. However, when I exported the session using
db-out there were only 30 annotation examples in the JSONL file.
This issue seems similar to this other one from earlier this year: ner.correct: Only 31 annotations to database no matter how many actually annotated everytime
I'm starting to suspect the TXT loader, but I can't examine the loaders.pyd file. I'll try to set up a test using TXT and JSONL loaders to see if I can replicate using different input files.
Edit for Update:
Well, with my tests below, I confirmed that it wasn't a 'direct' issue with the TXT data loader in ner.correct:
prodigy ner.correct test_ner .\Aug2-Sess1-model\model-best\ Jan_2021_Data_random.jsonl --label 2021_07_16_NER_labels2.txt --exclude 'alit_ner3,test_ner'
prodigy ner.correct test_ner2 .\Aug2-Sess1-model\model-best\ Jan_2021_Data_random.txt --loader txt --label 2021_07_16_NER_labels2.txt --exclude 'alit_ner3,test_ner2'
These are new data files and new datasets, and I ran both 2 times each to check the
--exclude logic with empty and some data. I could not replicate the issue with new files and datasets.
However, the problem persists with the original TXT file and datasets. I tried running
ner.correct with a different model path with the same result. I'm starting to suspect the existing datasets or the compare to the datasets to exclude examples might be the cause.
I'm going to 'start fresh' so that I can keep moving with my tagging, with more than 25 at a time.