Load annotated data in .spacy format to Prodigy for further correction

Hi @Andrey,

If the examples you need to correct are in .jsonl format it's as easy as feed them into the ner.manual recipe specifying the labels you want to have available.
So if the input file is called to_be_corrected.jsonl and assuming blank:en as the base model then the call would be:

python -m prodigy ner.manual ner_data_corrected blank:en dataset:ner_data --label MY_LABEL1, MY_LABEL2

That will load your examples with existing annotations and allow you to manually correct them and the result will be saved in the ner_data_corrected dataset.

If your data is in .spacy format, you'd have write a small script to convert it back to the .jsonl format and save on disk or save directly to the database.

To save on disk:

import spacy
from spacy.tokens import DocBin
import srsly

path_data = "train.spacy"

nlp = spacy.blank("en")

doc_bin = DocBin().from_disk(path_data)
examples = []
for doc in doc_bin.get_docs(nlp.vocab):
    spans = [{"start": ent.start_char, "end": ent.end_char, "label": ent.label_} for ent in doc.ents]
    examples.append({"text": doc.text, "spans": spans})

srsly.write_jsonl("train.jsonl", examples)

To save directly in the DB:

import spacy
from spacy.tokens import DocBin
from prodigy.components.db import connect
from prodigy import set_hashes

path_data = "train.spacy"

nlp = spacy.blank("en")

doc_bin = DocBin().from_disk(path_data)
examples = []
for doc in doc_bin.get_docs(nlp.vocab):
    spans = [{"start": ent.start_char, "end": ent.end_char, "label": ent.label_} for ent in doc.ents]
    examples.append({"text": doc.text, "spans": spans})

db = connect()                             # connect to the database
db.add_dataset("ner_sample")               # add dataset ner_sample
examples = (set_hashes(eg) for eg in examples) # add hashes; creates generator
db.add_examples(list(examples), ["ner_sample"])  # add examples to ner_sample;

The snippets come from this post on the topic :slight_smile:

1 Like