Convert spacy binary data to jsonl

Hi! Under the hood, the binary .spacy file is a serialized DocBin, which you can always load back as spaCy Doc objects: https://spacy.io/api/docbin This gives you access to the annotations, e.g. the doc.ents.

import spacy
from spacy.tokens import DocBin

nlp = spacy.blank("en")
doc_bin = DocBin().from_disk("./data.spacy")  # your file here
examples = []  # examples in Prodigy's format
for doc in doc_bin.get_docs(nlp.vocab):
    spans = [{"start": ent.start_char, "end": ent.end_char, "label": ent.label_} for ent in doc.ents]
    examples.append({"text": doc.text, "spans": spans})

In your case, this is probably the most straightforward solution because you already have the .spacy files and you need them for training anyways.

For completeness (and others who come across this thread later): of course, you'd generally don't need the .spacy conversion to use IOB data with Prodigy. Prodigy's format expects entities and other spans to be defined as character offsets. So if you have token-based annotations like IOB or BILUO, you can convert them to offsets – spaCy provides handy utility functions for that: https://prodi.gy/docs/named-entity-recognition#tip-biluo-offsets

1 Like