Annotated dataset lost, what now?

rahul1 · May 24, 2021, 11:26am

Hi, I am a policy officer at a funding agency.

I used prodi.gy to annotate around 500 documents with multiple custom labels (20!), and subsequently created a custom NER model. Unfortunately I had to format my laptop, which resulted reinstalling prodi.gy. I realised that I have lost the annotated dataset. I had NER model saved somewhere else before formating the laptop.

In summary; I have the NER model based on the annotations, but not the annotations.

Can you please suggest possible steps to increase the efficiency of the NER model?
(I know that I can use the NER model to annotate the original documents within prodi.gy again and based on the annotations dataset build a better model, but I would like to search for other options.)

regards
Rahul

ines · May 24, 2021, 11:48am

Aw, sorry to hear! The Prodigy database lives in your user home directory by default, so it sounds like that got lost in the formatting then. (In general, you can uninstall and re-install Prodigy without affecting your annotations and re-use the same existing database. But if there's a way you can auto-backup the .prodigy/prodigy.db file in the future, that'd probably be good.)

If you want to get your data back, this is a pretty good idea, actually! The NER model you trained on the data should get very high accuracy on the training examples. Maybe not 100%, but very high. So if you just run your model over the data, you should almost get your original annotations back 500 documents is small enough that it shouldn't take you too long to review if you want to double-check the examples. So you could run your model over the original raw data wit ner.correct, correct the mistakes (which shouldn't be many), and you'll have your original data back. It shouldn't take you longer than a few hours, to be honest!

(Alternatively, if you used prodigy train in Prodigy v1.10.x, check if you saved the whole directory it saved out? Because that should include a .json file of the original training and evaluation data.)

rahul1 · May 24, 2021, 12:53pm

Yes, I did use prodigy train in prodidy v1.10.x (I use prodi.gy since november 2020). I see that my model contains .json files (meta.json and strings.json). So, can I reimport string.json file, which I guess has the annotations, using db-in command?

regards
Rahul

ines · May 25, 2021, 1:41am

Ah, no, those are the meta information and the strings cache. And sorry, I just realised my above comment was wrong – it would have been an older version of Prodigy before we introduced the data-to-spacy command for exports. So unless you exported the training data explicitly, it wouldn't be in the model directory.

But anyway, since you have the model, just run it over your raw data and you should get annotations back that are very close to what you originally had

rahul1 · May 25, 2021, 10:12am

That's what I am doing now.
Thank you very much.

Topic		Replies	Views
Annotated Dataset and NER task with Prodigy usage , ner	6	887	February 3, 2023
Model vs Dataset Metric Weights usage , database , training	2	408	April 13, 2022
Make Prodigy "forget" the answers on data import usage , database , solved	2	534	November 4, 2020
Creating a revised annotation dataset, from the output of another NER model usage , ner , solved	1	405	September 20, 2020
After NER.correct, how do I train? ner , spacy , training	6	545	June 14, 2023

Annotated dataset lost, what now?

Related topics