Convert spacy binary data to jsonl

Hello,

I just recently got prodigy, and before that I was using another free program to annotate data for my spacy model. From this program, I could download my annotated data in iob format which I then converted to spacy binary format with the spacy convert command.

I would like to use this data that I already annotated in prodigy, partially to try to run the "train-curve" command to see how much improvement I could still get by adding more data.
I can't seem to find an easy way to convert a ".spacy", a spacy json file or ".iob" file to jsonl. I found the depreciated "ner.iob-to-gold", so I guess there might be a new recipe for that purpose but I can't find it. I also found the "data-to-spacy" that can do the reverse.

Is there a way to do this?

I'm using the latest version of spacy (3.0.5 I believe) and prodigy nightly.

Hi! Under the hood, the binary .spacy file is a serialized DocBin, which you can always load back as spaCy Doc objects: https://spacy.io/api/docbin This gives you access to the annotations, e.g. the doc.ents.

import spacy
from spacy.tokens import DocBin

nlp = spacy.blank("en")
doc_bin = DocBin().from_disk("./data.spacy")  # your file here
examples = []  # examples in Prodigy's format
for doc in doc_bin.get_docs(nlp.vocab):
    spans = [{"start": ent.start_char, "end": ent.end_char, "label": ent.label_} for ent in doc.ents]
    examples.append({"text": doc.text, "spans": spans})

In your case, this is probably the most straightforward solution because you already have the .spacy files and you need them for training anyways.

For completeness (and others who come across this thread later): of course, you'd generally don't need the .spacy conversion to use IOB data with Prodigy. Prodigy's format expects entities and other spans to be defined as character offsets. So if you have token-based annotations like IOB or BILUO, you can convert them to offsets – spaCy provides handy utility functions for that: https://prodi.gy/docs/named-entity-recognition#tip-biluo-offsets

1 Like

Thank you Ines, this worked like a charm!

As a little note to anyone who plans to reuse this script, just replace ent.label with ent.label_ to get the char instead of the label ID

Ah sorry, that was a typo :sweat_smile: Just edited my previous post!

Thanks for the hint @ines . I have a slightly different problem but very much related to your reply.

I have .spacy train and valid files and am trying to reuse my code that was used to work with JSON files (SpaCy v2). So what I was tying:

    nlp = spacy.load("my-model")

    doc_bin = DocBin().from_disk(path_test_data)
    examples = []
    for doc in doc_bin.get_docs(nlp.vocab):
        spans = [{"start": ent.start_char, "end": ent.end_char, "label": ent.label_} for ent in doc.ents]
        examples.append(Example.from_dict(nlp.make_doc(doc.text), spans))

However, this part is failing due to:

TypeError: Argument 'example_dict' has incorrect type (expected dict, got list)

Initially, my code was working as:

examples = []
    for text, annotations in TEST_DATA:
        examples.append(Example.from_dict(nlp.make_doc(text), annotations))

With the TEST_DATA being a JSON

Thank you in advance.

Hi @milos-cuculovic !

I think you've already posted in the spaCy discussions forum. For posterity, here's the link with the answer.