Similar issue to Duplicates in ner.correct in 1.10.2, except I'm using SpaCy 3.0.6 and Prodigy 1.11.0a8. I'm getting roughly 20 examples to correct with each invocation of the ner.correct session.
Originally I tagged NER categories initially with ner.manual:
After about 1700 examples - with training along the way - I decided to switch to ner.correct to have the model start helping with predictions. I reviewed my data set - data_ner2 - for overlaps/contentions in jupyter notebook (same prodigy environment) and added the corrected examples to a new data set:
Now I'm concerned that I've done the TXT loader incorrectly ... Or maybe there's still an issue with ner.correct that limits the examples loaded from TXT files to ~20 per session.
Hi! Could you try and upgrade to the latest nightly? I think the latest release includes some fixes that might be relevant here.
And can you share some more details on the exact problem you're seeing? Are you asked about examples that you previously already annotated in your dataset?
Hi Ines! Yes, I intended to upgrade to latest nightly to double check, but I had a sprint wrapping today that I was working on and figured maybe this was a known issue that I couldn't find.
I upgraded to 1.11.0a11 today, but the issue persists. I'm seeing that each invocation of prodigy ner.correct ... only gives me 25 examples to correct (see above for exact invocation including the TXT loader).
The start of the training:
Is this correct behavior? It doesn't feel like I'm getting very far just training on 25 examples. However, I might be missing some reason for this behavior as I just started using prodigy in the past 3 weeks.
I played around with this some more again today. The last test I ran was 130 annotations (looping after first 26 examples) that were saved to the SQLite database. However, when I exported the session using db-out there were only 30 annotation examples in the JSONL file.
I'm starting to suspect the TXT loader, but I can't examine the loaders.pyd file. I'll try to set up a test using TXT and JSONL loaders to see if I can replicate using different input files.
Edit for Update:
Well, with my tests below, I confirmed that it wasn't a 'direct' issue with the TXT data loader in ner.correct:
These are new data files and new datasets, and I ran both 2 times each to check the --exclude logic with empty and some data. I could not replicate the issue with new files and datasets.
However, the problem persists with the original TXT file and datasets. I tried running ner.correct with a different model path with the same result. I'm starting to suspect the existing datasets or the compare to the datasets to exclude examples might be the cause.
I'm going to 'start fresh' so that I can keep moving with my tagging, with more than 25 at a time.