Hi! Under the hood, the binary .spacy
file is a serialized DocBin
, which you can always load back as spaCy Doc
objects: https://spacy.io/api/docbin This gives you access to the annotations, e.g. the doc.ents
.
import spacy
from spacy.tokens import DocBin
nlp = spacy.blank("en")
doc_bin = DocBin().from_disk("./data.spacy") # your file here
examples = [] # examples in Prodigy's format
for doc in doc_bin.get_docs(nlp.vocab):
spans = [{"start": ent.start_char, "end": ent.end_char, "label": ent.label_} for ent in doc.ents]
examples.append({"text": doc.text, "spans": spans})
In your case, this is probably the most straightforward solution because you already have the .spacy
files and you need them for training anyways.
For completeness (and others who come across this thread later): of course, you'd generally don't need the .spacy
conversion to use IOB data with Prodigy. Prodigy's format expects entities and other spans to be defined as character offsets. So if you have token-based annotations like IOB or BILUO, you can convert them to offsets – spaCy provides handy utility functions for that: https://prodi.gy/docs/named-entity-recognition#tip-biluo-offsets