Load annotated data in .spacy format to Prodigy for further correction

HI all,

I have a set of 100 examples annotated for Span Classification task using Prodigy. I realised, that some of them (let's say, 35) should be corrected and re-annotated. I can extract all 100 annotations into either .spacy or .jsonl format and find those examples I need to amend.

My question is: once I found the required 35 examples, how do I load them back into Prodigy to correct?

Specifically, I loaded 100 examples using spaCy into Doc objects, selected the required ones and saved as .spacy. Is there a simple way of loading .spacy format into Prodigy to continue annotation, or I need to reformat it in a different way?

Overall, I heed to load these 35 examples back into Prodigy, amend the spans and save them again.

Many thanks in advance.

Hi @Andrey,

If the examples you need to correct are in .jsonl format it's as easy as feed them into the ner.manual recipe specifying the labels you want to have available.
So if the input file is called to_be_corrected.jsonl and assuming blank:en as the base model then the call would be:

python -m prodigy ner.manual ner_data_corrected blank:en dataset:ner_data --label MY_LABEL1, MY_LABEL2

That will load your examples with existing annotations and allow you to manually correct them and the result will be saved in the ner_data_corrected dataset.

If your data is in .spacy format, you'd have write a small script to convert it back to the .jsonl format and save on disk or save directly to the database.

To save on disk:

import spacy
from spacy.tokens import DocBin
import srsly

path_data = "train.spacy"

nlp = spacy.blank("en")

doc_bin = DocBin().from_disk(path_data)
examples = []
for doc in doc_bin.get_docs(nlp.vocab):
    spans = [{"start": ent.start_char, "end": ent.end_char, "label": ent.label_} for ent in doc.ents]
    examples.append({"text": doc.text, "spans": spans})

srsly.write_jsonl("train.jsonl", examples)

To save directly in the DB:

import spacy
from spacy.tokens import DocBin
from prodigy.components.db import connect
from prodigy import set_hashes

path_data = "train.spacy"

nlp = spacy.blank("en")

doc_bin = DocBin().from_disk(path_data)
examples = []
for doc in doc_bin.get_docs(nlp.vocab):
    spans = [{"start": ent.start_char, "end": ent.end_char, "label": ent.label_} for ent in doc.ents]
    examples.append({"text": doc.text, "spans": spans})

db = connect()                             # connect to the database
db.add_dataset("ner_sample")               # add dataset ner_sample
examples = (set_hashes(eg) for eg in examples) # add hashes; creates generator
db.add_examples(list(examples), ["ner_sample"])  # add examples to ner_sample;

The snippets come from this post on the topic :slight_smile:

1 Like

Thank you so much @magdaaniol ! I haven't realised that it would be enough to format the .jsonl file using only two "text" and "spans" keys: {"text":"...", "spans":[...]}.

Just implemented and tested - worked as a charm!

1 Like