CRUD operations on previously labeled spacy data

Hey all,
Looking to purchase prodigy to help train a custom NER model, but wanted to make sure it has functionality that I haven't seen listed anywhere.

Is it possible to upload previously labeled spacy training data, and preform CRUD operations to correct the labels?


Hi @nckearly !

Yes it's possible to upload previously labeled spaCy training data and correct their labels. The first thing you need to do is convert your labeled corpus into JSONL format to make it compatible with Prodigy.

If your training examples are already in spaCy's binary format, you can do something like this (cf. Convert spacy binary data to jsonl):

import spacy
from spacy.tokens import DocBin

nlp = spacy.blank("en")
doc_bin = DocBin().from_disk("./data.spacy")  # your file here
examples = []  # examples in Prodigy's format
for doc in doc_bin.get_docs(nlp.vocab):
    spans = [{"start": ent.start_char, "end": ent.end_char, "label": ent.label_} for ent in doc.ents]
    examples.append({"text": doc.text, "spans": spans})

Once you're done, you can then start a Prodigy session using the ner.manual recipe:

prodigy ner.manual dataset blank:en path/to/file.jsonl --label A,B,C

The updated annotations will then be saved into a SQLite database (configurable to MySQL, etc.), and you can export it back to JSONL using the prodigy db-out command.

Another reference: Script: Load data in spaCy v3's .spacy format