Restore lost annotated dataset from training.jsonl and evalution.jsonl found in a trained model

Hi Prodigy team!

I accidentally deleted a db where I had some annotated data. Before the deletion my teammate trained a few models inside of which I found training.jsonl and evalution.jsonl. Is it possible to restore the lost dataset from these files?

Hi! This should hopefully be easy – the exported data is in Prodigy's JSON format, so you can always re-import it to a dataset using the db-in command. For example:

prodigy db-in your_dataset ./training.jsonl

Hi Ines! Thank you for the answer.

Hi Ines! I have one more question. The guy who trained the model said that he had used both the active learning and the gold annotation strategies to annotate the data. Will the db-in command take that into account when importing? Or is there any way to distinguish what the model had predicted or what humans had annotated in the training.jsonl file.

Did they mix those two types of annotations in the same dataset? The training.jsonl file will only include the data used in that particular training session. Ideally, you wouldn't want to mix binary and manual annotations in the same set, because you'd want to use the data and train / evaluate the model differently depending on the data.

There are some clues in the data that can tell you whether annotations are binary questions about the model's predictions, or manual annotations created by a human or semiautomatically by a human and a model.

  • Binary annotations: Only ever have one entry in the "spans" (the entity to collect feedback on) and if the suggestion comes from the model, the task's "meta" typically contains the "score". In Prodigy v1.8+, tasks also have a "_view_id" storing the name of the annotation interface that was used. This should be "ner".
  • Manual annotations: Can have any number of "spans" and also contain "tokens" (because the text is pre-tokenized for easier highlighting). The "_view_id", if present, would typically be "ner_manual".