Convert DocBins or .spacy files to .jsonl format

ryanwesslen · January 3, 2023, 4:32pm

Thanks for your question and welcome to the Prodigy community

So you're right that the in a similar thread is your best bet. Since Prodigy is a developer tool, it is designed to be used for manual code to extend it's functionality for one-off cases like this. I've rewritten this like that post:

import spacy
from spacy.tokens import DocBin
import srsly

path_data = "train.spacy"

nlp = spacy.blank("en")

doc_bin = DocBin().from_disk(path_data)
examples = []
for doc in doc_bin.get_docs(nlp.vocab):
    spans = [{"start": ent.start_char, "end": ent.end_char, "label": ent.label_} for ent in doc.ents]
    examples.append({"text": doc.text, "spans": spans})

srsly.write_jsonl("train.jsonl", examples)

As you may have seen, you'd then need to run:

python -m prodigy db-in ner_sample train.jsonl

You can also skip the .jsonl step and instead load the data into the Prodigy database. To do this, you'd do the same steps above but instead of writing to .jsonl you can load directly into the Prodigy database:

import spacy
from spacy.tokens import DocBin
from prodigy.components.db import connect
from prodigy import set_hashes

path_data = "train.spacy"

nlp = spacy.blank("en")

doc_bin = DocBin().from_disk(path_data)
examples = []
for doc in doc_bin.get_docs(nlp.vocab):
    spans = [{"start": ent.start_char, "end": ent.end_char, "label": ent.label_} for ent in doc.ents]
    examples.append({"text": doc.text, "spans": spans})

db = connect()                             # connect to the database
db.add_dataset("ner_sample")               # add dataset ner_sample
examples = (set_hashes(eg) for eg in examples) # add hashes; creates generator
db.add_examples(list(examples), ["ner_sample"])  # add examples to ner_sample; need list as was generator

That code does about the same to db-in but doesn't have some of the checks. Therefore, I would recommend just loading with db-in but that'll still show you how it's possible.

I'm curious - can you explain why you're doing this? Is it because of some old .spacy files you created before using Prodigy (and no you want to them to a Prodigy workflow)? So this is more of a one-time load? Or do you plan to do this an on-going basis?

The reason there isn't an off-the-shelf recipe that does the opposite of data-to-spacy is because it's generally thought that users would want to move data out of the Prodigy database when training, not the other way around. Thanks for your help!

Topic		Replies	Views
Convert spacy binary data to jsonl ner , spacy , solved , nightly	5	4233	April 28, 2022
Convert spaCy training json file to prodigy jsonl format for db-in command enhancement , ner , spacy	1	594	June 15, 2020
Training prodigy ner data through spacy usage , ner , spacy , solved	3	893	January 8, 2020
How to train a NER model using spaCy 3 only, starting from prodigy (1.11) JSON files? usage , ner , spacy	1	2640	August 22, 2021
Converting SpaCy training json file to Prodigy jsonl format usage , spacy	9	3014	April 17, 2023

Convert DocBins or .spacy files to .jsonl format

Related topics