Script: Load data in spaCy v3's .spacy format

This topic came up in the Prodigy nightly thread: it'd be cool to have Prodigy load data and annotations in spaCy v3's new binary .spacy format, which is a serialized collection of Doc objects (DocBin) under the hood. However, it's difficult to provide a built-in solution here because the DocBin requires an nlp object and its vocab to restore the documents, and you often want to use different settings to decide whether to include existing annotations in the data, and if so, which annotations use.

Here's a quick script I wrote that takes a path to a .spacy file, the name of a spaCy pipeline or blank:en to use for loading back the Doc objects, and optional settings for including named entities as "spans" and text classification annotations as "options" (for use with textcat.manual).

It prints the converted JSON data, so you can pipe its output forward to any Prodigy recipe. :warning: Note the - used here as the input source argument: it tells Prodigy to read from the output piped forward by the previous process (i.e. the loader script). You can read more about this here.

Usage examples

python load_docbin.py ./train.spacy blank:en --ner | prodigy ner.manual your_dataset blank:en - --label PERSON,ORG
python load_docbin.py ./train.spacy blank:en --textcat | prodigy textcat.manual your_dataset - --label CAT1,CAT2

load_docbin.py

from typing import Iterator, Dict, Any
import spacy
from spacy.tokens import DocBin
from spacy.language import Language
from prodigy.components.preprocess import get_token, sync_spans_to_tokens
from pathlib import Path
import typer


def main(
    # fmt: off
    path: Path = typer.Argument(..., help="Path to .spacy file"),
    spacy_model: str = typer.Argument(..., help="Name or path to spaCy pipeline or blank:en etc. for blank model, used to load DocBin"),
    include_ner: bool = typer.Option(False, "--ner", "-N", help="Include doc.ents as spans, if available"),
    include_textcat: bool = typer.Option(False, "--textcat", "-T", help="Include doc.cats as options, if available")
    # fmt: on
):
    """Load a binary .spacy file and output annotations in Prodigy's format."""
    nlp = spacy.load(spacy_model)
    doc_bin = DocBin().from_disk(path)
    examples = convert_examples(
        nlp, doc_bin, include_ner=include_ner, include_textcat=include_textcat
    )
    for eg in examples:
        print(eg)


def convert_examples(
    nlp: Language,
    doc_bin: DocBin,
    include_ner: bool = False,
    include_textcat: bool = False,
) -> Iterator[Dict[str, Any]]:
    docs = doc_bin.get_docs(nlp.vocab)
    for doc in docs:
        eg = {"text": doc.text, "tokens": [get_token(token, token.i) for token in doc]}
        if include_ner:
            spans = [
                {"start": ent.start_char, "end": ent.end_char, "label": ent.label_}
                for ent in doc.ents
            ]
            eg["spans"] = sync_spans_to_tokens(spans, eg["tokens"])
        if include_textcat:
            eg["options"] = [{"id": cat, "text": cat} for cat in doc.cats]
            eg["accept"] = [cat for cat, score in doc.cats.items() if score == 1.0]
        yield eg


if __name__ == "__main__":
    typer.run(main)

In a custom recipe, you could also include the convert_examples function directly, or customise it to provide different annotations. For instance, you could use the new spans.manual with span annotations defined under a given key of doc.spans. You can extract the annotations the same way you'd normally access them on a processed Doc object.

4 Likes

Hi @ines!

Thanks a lot for the sample script! I was having some trouble. While running it, I kept getting this error:

✘ Failed to load task (invalid JSON on line 1)
This error pretty much always means that there's something wrong with this line
of JSON and Python can't load it. Even if you think it's correct, something must
confuse it. Try calling json.loads(line) on each line or use a JSON linter.

Which I traced back to the "text" key of each example being rendered with a single quote instead of double quotes from stdout:

python load_docbin.py ./train.spacy blank:en --ner
> {'text': 'my text' etc}

So I fixed the bug by calling json.dumps on each example. Here is the updated script in case anyone needs it:

load_docbin.py

from typing import Iterator, Dict, Any
import spacy
from spacy.tokens import DocBin
from spacy.language import Language
from prodigy.components.preprocess import get_token, sync_spans_to_tokens
from pathlib import Path
import typer
import json

def main(
    # fmt: off
    path: Path = typer.Argument(..., help="Path to .spacy file"),
    spacy_model: str = typer.Argument(..., help="Name or path to spaCy pipeline or blank:en etc. for blank model, used to load DocBin"),
    include_ner: bool = typer.Option(False, "--ner", "-N", help="Include doc.ents as spans, if available"),
    include_textcat: bool = typer.Option(False, "--textcat", "-T", help="Include doc.cats as options, if available")
    # fmt: on
):
    """Load a binary .spacy file and output annotations in Prodigy's format."""
    nlp = spacy.load(spacy_model)
    doc_bin = DocBin().from_disk(path)
    examples = convert_examples(
        nlp, doc_bin, include_ner=include_ner, include_textcat=include_textcat
    )
    for eg in examples:
        print(eg)


def convert_examples(
    nlp: Language,
    doc_bin: DocBin,
    include_ner: bool = False,
    include_textcat: bool = False,
) -> Iterator[Dict[str, Any]]:
    docs = doc_bin.get_docs(nlp.vocab)
    for doc in docs:
        eg = {"text": doc.text, "tokens": [get_token(token, token.i) for token in doc]}
        if include_ner:
            spans = [
                {"start": ent.start_char, "end": ent.end_char, "label": ent.label_}
                for ent in doc.ents
            ]
            eg["spans"] = sync_spans_to_tokens(spans, eg["tokens"])
        if include_textcat:
            eg["options"] = [{"id": cat, "text": cat} for cat in doc.cats]
            eg["accept"] = [cat for cat, score in doc.cats.items() if score == 1.0]
        yield json.dumps(eg)


if __name__ == "__main__":
    typer.run(main)

Also, for folks who would like to convert .spacy format to JSONL, you can call the above script like so:

python load_docbin.py dev.spacy blank:en --ner > out.jsonl
2 Likes

hi @drom!

Thank you for your fix! We really appreciate the help.

And I saw this is your first post -- welcome to the Prodigy community :wave:

We're happy to have you join us. :slight_smile:

1 Like

Thank you @ryanwesslen! Happy to be around :slight_smile:

Thank you for providing the script!

Tried using it for writing files to be used in the review recipe. However, this won't work unless you add additional information to the json dumps.
For anyone stumbling across this post in hopes of doing just that, I e.g. added this for my manual-ner review:

    for doc in docs:
        eg = {
            "text": doc.text,
            "tokens": [get_token(token, token.i) for token in doc],
            "_is_binary": False,
            "_view_id": "ner_manual",
            "answer": "accept",
            "_timestamp": <some_timestamp>,
        }