Restore lost annotated dataset from training.jsonl and evalution.jsonl found in a trained model

Alexey · January 17, 2020, 2:14pm

Hi Prodigy team!

I accidentally deleted a db where I had some annotated data. Before the deletion my teammate trained a few models inside of which I found training.jsonl and evalution.jsonl. Is it possible to restore the lost dataset from these files?

ines · January 19, 2020, 12:33pm

Hi! This should hopefully be easy – the exported data is in Prodigy's JSON format, so you can always re-import it to a dataset using the db-in command. For example:

prodigy db-in your_dataset ./training.jsonl

Alexey · January 20, 2020, 9:54am

Hi Ines! Thank you for the answer.

Alexey · January 21, 2020, 10:14am

Hi Ines! I have one more question. The guy who trained the model said that he had used both the active learning and the gold annotation strategies to annotate the data. Will the db-in command take that into account when importing? Or is there any way to distinguish what the model had predicted or what humans had annotated in the training.jsonl file.

ines · January 21, 2020, 12:57pm

Did they mix those two types of annotations in the same dataset? The training.jsonl file will only include the data used in that particular training session. Ideally, you wouldn't want to mix binary and manual annotations in the same set, because you'd want to use the data and train / evaluate the model differently depending on the data.

There are some clues in the data that can tell you whether annotations are binary questions about the model's predictions, or manual annotations created by a human or semiautomatically by a human and a model.

Binary annotations: Only ever have one entry in the "spans" (the entity to collect feedback on) and if the suggestion comes from the model, the task's "meta" typically contains the "score". In Prodigy v1.8+, tasks also have a "_view_id" storing the name of the annotation interface that was used. This should be "ner".
Manual annotations: Can have any number of "spans" and also contain "tokens" (because the text is pre-tokenized for easier highlighting). The "_view_id", if present, would typically be "ner_manual".

Topic		Replies	Views
Importing existing custom annotated data from brat usage	7	1897	September 29, 2018
Make Prodigy "forget" the answers on data import usage , database , solved	2	534	November 4, 2020
Annotated dataset lost, what now? usage , database , solved	4	408	May 25, 2021
Adding new data to be annotated without re-starting the server usage , database	10	246	November 3, 2023
Datasets and using pre-annotated data Getting Started usage , solved	23	5516	November 15, 2020

Restore lost annotated dataset from training.jsonl and evalution.jsonl found in a trained model

Related topics