hi @emiltj!
Thanks for your question and welcome to the Prodigy community
So you're right that the in a similar thread is your best bet. Since Prodigy is a developer tool, it is designed to be used for manual code to extend it's functionality for one-off cases like this. I've rewritten this like that post:
import spacy
from spacy.tokens import DocBin
import srsly
path_data = "train.spacy"
nlp = spacy.blank("en")
doc_bin = DocBin().from_disk(path_data)
examples = []
for doc in doc_bin.get_docs(nlp.vocab):
spans = [{"start": ent.start_char, "end": ent.end_char, "label": ent.label_} for ent in doc.ents]
examples.append({"text": doc.text, "spans": spans})
srsly.write_jsonl("train.jsonl", examples)
As you may have seen, you'd then need to run:
python -m prodigy db-in ner_sample train.jsonl
You can also skip the .jsonl
step and instead load the data into the Prodigy database. To do this, you'd do the same steps above but instead of writing to .jsonl
you can load directly into the Prodigy database:
import spacy
from spacy.tokens import DocBin
from prodigy.components.db import connect
from prodigy import set_hashes
path_data = "train.spacy"
nlp = spacy.blank("en")
doc_bin = DocBin().from_disk(path_data)
examples = []
for doc in doc_bin.get_docs(nlp.vocab):
spans = [{"start": ent.start_char, "end": ent.end_char, "label": ent.label_} for ent in doc.ents]
examples.append({"text": doc.text, "spans": spans})
db = connect() # connect to the database
db.add_dataset("ner_sample") # add dataset ner_sample
examples = (set_hashes(eg) for eg in examples) # add hashes; creates generator
db.add_examples(list(examples), ["ner_sample"]) # add examples to ner_sample; need list as was generator
That code does about the same to db-in
but doesn't have some of the checks. Therefore, I would recommend just loading with db-in
but that'll still show you how it's possible.
I'm curious - can you explain why you're doing this? Is it because of some old .spacy files you created before using Prodigy (and no you want to them to a Prodigy workflow)? So this is more of a one-time load? Or do you plan to do this an on-going basis?
The reason there isn't an off-the-shelf recipe that does the opposite of data-to-spacy
is because it's generally thought that users would want to move data out of the Prodigy database when training, not the other way around. Thanks for your help!