Having removed a lot of inconsistent labels from the data set, my model's F1 went down a fair bit, which surprised me. Could the decrease in F1 be because the the wrong and corrected data are now mixed up in the same dataset? What's the recommended best practice when reviewing labels?
Thanks
PS Some other info that might be useful:
As I was doing the review, I noticed the number of annotated example in the sidebar went up, which didn't set off any alarm bells at the time.
My prodigy version
Version 1.11.0a4
Platform macOS-10.15.7-x86_64-i386-64bit
Python Version 3.8.0
Database Name SQLite
Database Id sqlite
Total Datasets 6
Total Sessions 85
This sounds like the most likely explanation, yes. If you accidentally ended up with duplicate and conflicting examples in your dataset and train from the data or export it for use with spaCy, Prodigy will try to merge whatever it can and discard later annotations that conflict with what's already there.
So when you're using the review recipe, you typically want to save the result to a new dataset so you have one clean and final set that includes the final decisions you/the reviewer(s) made.
If you want to extract the reviewed annotations from your other dataset again, the most straightforward way would be to export it or load it in Python and look for examples with the "_view_id": "review". Those will be the examples created with the review workflow, and you can then add them to a fresh dataset and train from that
(I've also been thinking about ways to make this easier or add more checks to prevent mistakes like accidentally mixing up datasets. It's just a bit tricky, because Prodigy tries to make very few assumptions about the data and what it "means".)