Annotated dataset lost, what now?

Hi, I am a policy officer at a funding agency.

I used to annotate around 500 documents with multiple custom labels (20!), and subsequently created a custom NER model. Unfortunately I had to format my laptop, which resulted reinstalling I realised that I have lost the annotated dataset. I had NER model saved somewhere else before formating the laptop.

In summary; I have the NER model based on the annotations, but not the annotations.

Can you please suggest possible steps to increase the efficiency of the NER model?
(I know that I can use the NER model to annotate the original documents within again and based on the annotations dataset build a better model, but I would like to search for other options.)


Aw, sorry to hear! The Prodigy database lives in your user home directory by default, so it sounds like that got lost in the formatting then. (In general, you can uninstall and re-install Prodigy without affecting your annotations and re-use the same existing database. But if there's a way you can auto-backup the .prodigy/prodigy.db file in the future, that'd probably be good.)

If you want to get your data back, this is a pretty good idea, actually! The NER model you trained on the data should get very high accuracy on the training examples. Maybe not 100%, but very high. So if you just run your model over the data, you should almost get your original annotations back :smiley: 500 documents is small enough that it shouldn't take you too long to review if you want to double-check the examples. So you could run your model over the original raw data wit ner.correct, correct the mistakes (which shouldn't be many), and you'll have your original data back. It shouldn't take you longer than a few hours, to be honest!

(Alternatively, if you used prodigy train in Prodigy v1.10.x, check if you saved the whole directory it saved out? Because that should include a .json file of the original training and evaluation data.)

Yes, I did use prodigy train in prodidy v1.10.x (I use since november 2020). I see that my model contains .json files (meta.json and strings.json). So, can I reimport string.json file, which I guess has the annotations, using db-in command?


Ah, no, those are the meta information and the strings cache. And sorry, I just realised my above comment was wrong – it would have been an older version of Prodigy before we introduced the data-to-spacy command for exports. So unless you exported the training data explicitly, it wouldn't be in the model directory.

But anyway, since you have the model, just run it over your raw data and you should get annotations back that are very close to what you originally had :slightly_smiling_face:

That's what I am doing now. :grinning:
Thank you very much.

1 Like