Load annotated data in .spacy format to Prodigy for further correction

magdaaniol · September 14, 2023, 2:31pm

If the examples you need to correct are in .jsonl format it's as easy as feed them into the ner.manual recipe specifying the labels you want to have available.
So if the input file is called to_be_corrected.jsonl and assuming blank:en as the base model then the call would be:

python -m prodigy ner.manual ner_data_corrected blank:en dataset:ner_data --label MY_LABEL1, MY_LABEL2

That will load your examples with existing annotations and allow you to manually correct them and the result will be saved in the ner_data_corrected dataset.

If your data is in .spacy format, you'd have write a small script to convert it back to the .jsonl format and save on disk or save directly to the database.

To save on disk:

import spacy
from spacy.tokens import DocBin
import srsly

path_data = "train.spacy"

nlp = spacy.blank("en")

doc_bin = DocBin().from_disk(path_data)
examples = []
for doc in doc_bin.get_docs(nlp.vocab):
    spans = [{"start": ent.start_char, "end": ent.end_char, "label": ent.label_} for ent in doc.ents]
    examples.append({"text": doc.text, "spans": spans})

srsly.write_jsonl("train.jsonl", examples)

To save directly in the DB:

import spacy
from spacy.tokens import DocBin
from prodigy.components.db import connect
from prodigy import set_hashes

path_data = "train.spacy"

nlp = spacy.blank("en")

doc_bin = DocBin().from_disk(path_data)
examples = []
for doc in doc_bin.get_docs(nlp.vocab):
    spans = [{"start": ent.start_char, "end": ent.end_char, "label": ent.label_} for ent in doc.ents]
    examples.append({"text": doc.text, "spans": spans})

db = connect()                             # connect to the database
db.add_dataset("ner_sample")               # add dataset ner_sample
examples = (set_hashes(eg) for eg in examples) # add hashes; creates generator
db.add_examples(list(examples), ["ner_sample"])  # add examples to ner_sample;

The snippets come from this post on the topic

Topic		Replies	Views
How to overwrite/correct annotations? ner , solved	7	2064	September 7, 2021
Getting Started Questions usage , ner	1	631	November 6, 2018
Training prodigy ner data through spacy usage , ner , spacy , solved	3	893	January 8, 2020
ner.train on data not annotated by Spacy? ner	3	1148	June 11, 2018
CRUD operations on previously labeled spacy data usage , spacy , solved	1	502	November 15, 2021

Load annotated data in .spacy format to Prodigy for further correction

Related topics