CRUD operations on previously labeled spacy data

nckearly · November 11, 2021, 8:37pm

Hey all,
Looking to purchase prodigy to help train a custom NER model, but wanted to make sure it has functionality that I haven't seen listed anywhere.

Is it possible to upload previously labeled spacy training data, and preform CRUD operations to correct the labels?

Thanks!

ljvmiranda921 · November 15, 2021, 12:30am

Hi @nckearly !

Yes it's possible to upload previously labeled spaCy training data and correct their labels. The first thing you need to do is convert your labeled corpus into JSONL format to make it compatible with Prodigy.

If your training examples are already in spaCy's binary format, you can do something like this (cf. Convert spacy binary data to jsonl):

import spacy
from spacy.tokens import DocBin

nlp = spacy.blank("en")
doc_bin = DocBin().from_disk("./data.spacy")  # your file here
examples = []  # examples in Prodigy's format
for doc in doc_bin.get_docs(nlp.vocab):
    spans = [{"start": ent.start_char, "end": ent.end_char, "label": ent.label_} for ent in doc.ents]
    examples.append({"text": doc.text, "spans": spans})

Once you're done, you can then start a Prodigy session using the ner.manual recipe:

prodigy ner.manual dataset blank:en path/to/file.jsonl --label A,B,C

The updated annotations will then be saved into a SQLite database (configurable to MySQL, etc.), and you can export it back to JSONL using the prodigy db-out command.

Another reference: Script: Load data in spaCy v3's .spacy format

Topic		Replies	Views
Load annotated data in .spacy format to Prodigy for further correction	2	313	September 20, 2023
Feeding prodigy annotated data to spacy in python usage , spacy , training	4	651	October 8, 2021
Data format for label correction task based on pre-labelled dataset Getting Started	5	351	June 24, 2022
how to extend an already labeled corpus? usage , ner , solved	5	1085	June 29, 2019
update spacy model ner , spacy , solved , training	6	1135	October 8, 2021

CRUD operations on previously labeled spacy data

Related topics