Review into the same dataset (v1.11.04a)

mbatchkarov · March 10, 2021, 5:53pm

Hi, first of all, thanks for making a great product- it's well worth the price tag.

I was reviewed some annotations and accidentally stored the results in the same dataset:

prodigy review dataset dataset -v ner_manual --label A,B,C

Having removed a lot of inconsistent labels from the data set, my model's F1 went down a fair bit, which surprised me. Could the decrease in F1 be because the the wrong and corrected data are now mixed up in the same dataset? What's the recommended best practice when reviewing labels?

Thanks

PS Some other info that might be useful:

As I was doing the review, I noticed the number of annotated example in the sidebar went up, which didn't set off any alarm bells at the time.

My prodigy version

Version          1.11.0a4
Platform         macOS-10.15.7-x86_64-i386-64bit
Python Version   3.8.0
Database Name    SQLite
Database Id      sqlite
Total Datasets   6
Total Sessions   85

ines · March 12, 2021, 2:06am

Hi and thanks, that's nice to hear

This sounds like the most likely explanation, yes. If you accidentally ended up with duplicate and conflicting examples in your dataset and train from the data or export it for use with spaCy, Prodigy will try to merge whatever it can and discard later annotations that conflict with what's already there.

So when you're using the review recipe, you typically want to save the result to a new dataset so you have one clean and final set that includes the final decisions you/the reviewer(s) made.

If you want to extract the reviewed annotations from your other dataset again, the most straightforward way would be to export it or load it in Python and look for examples with the "_view_id": "review". Those will be the examples created with the review workflow, and you can then add them to a fresh dataset and train from that

(I've also been thinking about ways to make this easier or add more checks to prevent mistakes like accidentally mixing up datasets. It's just a bit tricky, because Prodigy tries to make very few assumptions about the data and what it "means".)

Topic		Replies	Views
Review recipe unable to save a dataset to itself usage , review	3	694	July 13, 2022
Reviewing/Editing annotated data usage , review , streams	1	966	June 23, 2020
Review my previous reviews usage , ner , solved , review	4	1076	May 24, 2021
Review dataset with multiple input hashes usage , best-practices , review	6	908	June 8, 2021
How to produce a full dataset without conflicts and duplicates. ner , review	3	393	May 10, 2023

Review into the same dataset (v1.11.04a)

Related topics