Convert DocBins or .spacy files to .jsonl format

hi @emiltj!

Thanks for your question and welcome to the Prodigy community :wave:

So you're right that the in a similar thread is your best bet. Since Prodigy is a developer tool, it is designed to be used for manual code to extend it's functionality for one-off cases like this. I've rewritten this like that post:

import spacy
from spacy.tokens import DocBin
import srsly

path_data = "train.spacy"

nlp = spacy.blank("en")

doc_bin = DocBin().from_disk(path_data)
examples = []
for doc in doc_bin.get_docs(nlp.vocab):
    spans = [{"start": ent.start_char, "end": ent.end_char, "label": ent.label_} for ent in doc.ents]
    examples.append({"text": doc.text, "spans": spans})

srsly.write_jsonl("train.jsonl", examples)

As you may have seen, you'd then need to run:

python -m prodigy db-in ner_sample train.jsonl

You can also skip the .jsonl step and instead load the data into the Prodigy database. To do this, you'd do the same steps above but instead of writing to .jsonl you can load directly into the Prodigy database:

import spacy
from spacy.tokens import DocBin
from prodigy.components.db import connect
from prodigy import set_hashes

path_data = "train.spacy"

nlp = spacy.blank("en")

doc_bin = DocBin().from_disk(path_data)
examples = []
for doc in doc_bin.get_docs(nlp.vocab):
    spans = [{"start": ent.start_char, "end": ent.end_char, "label": ent.label_} for ent in doc.ents]
    examples.append({"text": doc.text, "spans": spans})

db = connect()                             # connect to the database
db.add_dataset("ner_sample")               # add dataset ner_sample
examples = (set_hashes(eg) for eg in examples) # add hashes; creates generator
db.add_examples(list(examples), ["ner_sample"])  # add examples to ner_sample; need list as was generator

That code does about the same to db-in but doesn't have some of the checks. Therefore, I would recommend just loading with db-in but that'll still show you how it's possible.

I'm curious - can you explain why you're doing this? Is it because of some old .spacy files you created before using Prodigy (and no you want to them to a Prodigy workflow)? So this is more of a one-time load? Or do you plan to do this an on-going basis?

The reason there isn't an off-the-shelf recipe that does the opposite of data-to-spacy is because it's generally thought that users would want to move data out of the Prodigy database when training, not the other way around. Thanks for your help!