Load annotated data in .spacy format to Prodigy for further correction

Andrey · September 14, 2023, 10:15am

HI all,

I have a set of 100 examples annotated for Span Classification task using Prodigy. I realised, that some of them (let's say, 35) should be corrected and re-annotated. I can extract all 100 annotations into either .spacy or .jsonl format and find those examples I need to amend.

My question is: once I found the required 35 examples, how do I load them back into Prodigy to correct?

Specifically, I loaded 100 examples using spaCy into Doc objects, selected the required ones and saved as .spacy. Is there a simple way of loading .spacy format into Prodigy to continue annotation, or I need to reformat it in a different way?

Overall, I heed to load these 35 examples back into Prodigy, amend the spans and save them again.

Many thanks in advance.

magdaaniol · September 14, 2023, 2:31pm

Hi @Andrey,

If the examples you need to correct are in .jsonl format it's as easy as feed them into the ner.manual recipe specifying the labels you want to have available.
So if the input file is called to_be_corrected.jsonl and assuming blank:en as the base model then the call would be:

python -m prodigy ner.manual ner_data_corrected blank:en dataset:ner_data --label MY_LABEL1, MY_LABEL2

That will load your examples with existing annotations and allow you to manually correct them and the result will be saved in the ner_data_corrected dataset.

If your data is in .spacy format, you'd have write a small script to convert it back to the .jsonl format and save on disk or save directly to the database.

To save on disk:

import spacy
from spacy.tokens import DocBin
import srsly

path_data = "train.spacy"

nlp = spacy.blank("en")

doc_bin = DocBin().from_disk(path_data)
examples = []
for doc in doc_bin.get_docs(nlp.vocab):
    spans = [{"start": ent.start_char, "end": ent.end_char, "label": ent.label_} for ent in doc.ents]
    examples.append({"text": doc.text, "spans": spans})

srsly.write_jsonl("train.jsonl", examples)

To save directly in the DB:

import spacy
from spacy.tokens import DocBin
from prodigy.components.db import connect
from prodigy import set_hashes

path_data = "train.spacy"

nlp = spacy.blank("en")

doc_bin = DocBin().from_disk(path_data)
examples = []
for doc in doc_bin.get_docs(nlp.vocab):
    spans = [{"start": ent.start_char, "end": ent.end_char, "label": ent.label_} for ent in doc.ents]
    examples.append({"text": doc.text, "spans": spans})

db = connect()                             # connect to the database
db.add_dataset("ner_sample")               # add dataset ner_sample
examples = (set_hashes(eg) for eg in examples) # add hashes; creates generator
db.add_examples(list(examples), ["ner_sample"])  # add examples to ner_sample;

The snippets come from this post on the topic

Andrey · September 20, 2023, 1:42pm

Thank you so much @magdaaniol ! I haven't realised that it would be enough to format the .jsonl file using only two "text" and "spans" keys: {"text":"...", "spans":[...]}.

Just implemented and tested - worked as a charm!

Topic		Replies	Views
CRUD operations on previously labeled spacy data usage , spacy , solved	1	502	November 15, 2021
Script: Load data in spaCy v3's .spacy format Getting Started spacy , project , streams , nightly	4	2395	January 21, 2023
Text corpus .txt file to json/spacy format file usage , spacy , solved	5	1319	July 2, 2021
Convert DocBins or .spacy files to .jsonl format usage , ner , spacy	2	841	January 3, 2023
Re-labling custom dataset with Prodigy usage , ner	2	606	June 28, 2021

Load annotated data in .spacy format to Prodigy for further correction

Related topics