Hi @Andrey,
If the examples you need to correct are in .jsonl
format it's as easy as feed them into the ner.manual
recipe specifying the labels you want to have available.
So if the input file is called to_be_corrected.jsonl
and assuming blank:en
as the base model then the call would be:
python -m prodigy ner.manual ner_data_corrected blank:en dataset:ner_data --label MY_LABEL1, MY_LABEL2
That will load your examples with existing annotations and allow you to manually correct them and the result will be saved in the ner_data_corrected
dataset.
If your data is in .spacy
format, you'd have write a small script to convert it back to the .jsonl
format and save on disk or save directly to the database.
To save on disk:
import spacy
from spacy.tokens import DocBin
import srsly
path_data = "train.spacy"
nlp = spacy.blank("en")
doc_bin = DocBin().from_disk(path_data)
examples = []
for doc in doc_bin.get_docs(nlp.vocab):
spans = [{"start": ent.start_char, "end": ent.end_char, "label": ent.label_} for ent in doc.ents]
examples.append({"text": doc.text, "spans": spans})
srsly.write_jsonl("train.jsonl", examples)
To save directly in the DB:
import spacy
from spacy.tokens import DocBin
from prodigy.components.db import connect
from prodigy import set_hashes
path_data = "train.spacy"
nlp = spacy.blank("en")
doc_bin = DocBin().from_disk(path_data)
examples = []
for doc in doc_bin.get_docs(nlp.vocab):
spans = [{"start": ent.start_char, "end": ent.end_char, "label": ent.label_} for ent in doc.ents]
examples.append({"text": doc.text, "spans": spans})
db = connect() # connect to the database
db.add_dataset("ner_sample") # add dataset ner_sample
examples = (set_hashes(eg) for eg in examples) # add hashes; creates generator
db.add_examples(list(examples), ["ner_sample"]) # add examples to ner_sample;
The snippets come from this post on the topic