Convert DocBins or .spacy files to .jsonl format

Hi community!

I am looking to add a number of DocBins / .spacy files that include NER annotations for my Prodigy database, as I am resolving inconsistencies between multiple annotators. It seems as if it is only possible to add them in .jsonl format, so I have been trying to convert a number of DocBins / .spacy files to .jsonl as a step before adding them to the database.

I see that there is a data-to-spacy recipe, but it of course does the opposite of what I am trying to achieve. I would like to avoid having to manually code it in the fashion that has been recommended in in a similar thread.

Does anyone know how to do that?
Or alternatively, how to add .spacy files to my db?

hi @emiltj!

Thanks for your question and welcome to the Prodigy community :wave:

So you're right that the in a similar thread is your best bet. Since Prodigy is a developer tool, it is designed to be used for manual code to extend it's functionality for one-off cases like this. I've rewritten this like that post:

import spacy
from spacy.tokens import DocBin
import srsly

path_data = "train.spacy"

nlp = spacy.blank("en")

doc_bin = DocBin().from_disk(path_data)
examples = []
for doc in doc_bin.get_docs(nlp.vocab):
    spans = [{"start": ent.start_char, "end": ent.end_char, "label": ent.label_} for ent in doc.ents]
    examples.append({"text": doc.text, "spans": spans})

srsly.write_jsonl("train.jsonl", examples)

As you may have seen, you'd then need to run:

python -m prodigy db-in ner_sample train.jsonl

You can also skip the .jsonl step and instead load the data into the Prodigy database. To do this, you'd do the same steps above but instead of writing to .jsonl you can load directly into the Prodigy database:

import spacy
from spacy.tokens import DocBin
from prodigy.components.db import connect
from prodigy import set_hashes

path_data = "train.spacy"

nlp = spacy.blank("en")

doc_bin = DocBin().from_disk(path_data)
examples = []
for doc in doc_bin.get_docs(nlp.vocab):
    spans = [{"start": ent.start_char, "end": ent.end_char, "label": ent.label_} for ent in doc.ents]
    examples.append({"text": doc.text, "spans": spans})

db = connect()                             # connect to the database
db.add_dataset("ner_sample")               # add dataset ner_sample
examples = (set_hashes(eg) for eg in examples) # add hashes; creates generator
db.add_examples(list(examples), ["ner_sample"])  # add examples to ner_sample; need list as was generator

That code does about the same to db-in but doesn't have some of the checks. Therefore, I would recommend just loading with db-in but that'll still show you how it's possible.

I'm curious - can you explain why you're doing this? Is it because of some old .spacy files you created before using Prodigy (and no you want to them to a Prodigy workflow)? So this is more of a one-time load? Or do you plan to do this an on-going basis?

The reason there isn't an off-the-shelf recipe that does the opposite of data-to-spacy is because it's generally thought that users would want to move data out of the Prodigy database when training, not the other way around. Thanks for your help!

Hi Ryan,

First of all, thank you for your elaborate answer. I'll be implementing the more manual way of writing .jsonl files from DocBins.

Background:
I have had a lot of texts been annotated using Prodigy by 10 different raters. Some texts have been annotated by all raters, while other texts have been annotated by a single rater. My end goal is creating a new gold-standard dataset. My take on it has been to 1) use the review recipe on the texts that have been annotated by multiple raters, to get a gold-standard dataset on this part of the full data. 2) Then train a NER model on this gold-standard sub-dataset, and predict the remaining annotated texts (that have only been annotated by a single rater). 3) then review any potential disagreements between the predictions from the model and annotations from the single raters - thus getting a gold-standard dataset on the remaining data. And 4), to finally merge the two gold-standard datasets (the reviewed annotated texts that has been rated by multiple raters, with the reviewed annotated texts that previously were annotated by a single rater and the NER model).

For why - more specifically - I have been doing what I have been doing:
As the raters disagree on a lot of the tagging, I am going to use the recipe 'review' to handle rater disagreements. I see that this recipe can handle automatically accepting annotations without conflicts, but since I have annotations from 10 different raters, I am going to accept partial agreements over a certain threshold (e.g. if 70% of rater agree on a tag, I will accept it). Also, to make the recipe review process easier with the many texts, I want to delete infrequent annotations, so as to skip them within review process. This, I have done in a script, using spaCy. As my next step in the process is to review the remaining differences using the review recipe, I wanted them back in my db.

The process of automatically accepting partial agreements over a certain threshold, and declining those under a certain threshold is something I am going to multiple times. It would have been great to have this feature in the review recipe.

However, there may very well be functionality within Prodigy that I have missed. Is this the case?
If so, I would greatly appreciate some insight.

Again, thank you for your answer.