Script: Load data in spaCy v3's .spacy format

This topic came up in the Prodigy nightly thread: it'd be cool to have Prodigy load data and annotations in spaCy v3's new binary .spacy format, which is a serialized collection of Doc objects (DocBin) under the hood. However, it's difficult to provide a built-in solution here because the DocBin requires an nlp object and its vocab to restore the documents, and you often want to use different settings to decide whether to include existing annotations in the data, and if so, which annotations use.

Here's a quick script I wrote that takes a path to a .spacy file, the name of a spaCy pipeline or blank:en to use for loading back the Doc objects, and optional settings for including named entities as "spans" and text classification annotations as "options" (for use with textcat.manual).

It prints the converted JSON data, so you can pipe its output forward to any Prodigy recipe. :warning: Note the - used here as the input source argument: it tells Prodigy to read from the output piped forward by the previous process (i.e. the loader script). You can read more about this here.

Usage examples

python load_docbin.py ./train.spacy blank:en --ner | prodigy ner.manual your_dataset blank:en - --label PERSON,ORG
python load_docbin.py ./train.spacy blank:en --textcat | prodigy textcat.manual your_dataset - --label CAT1,CAT2

load_docbin.py

from typing import Iterator, Dict, Any
import spacy
from spacy.tokens import DocBin
from spacy.language import Language
from prodigy.components.preprocess import get_token, sync_spans_to_tokens
from pathlib import Path
import typer


def main(
    # fmt: off
    path: Path = typer.Argument(..., help="Path to .spacy file"),
    spacy_model: str = typer.Argument(..., help="Name or path to spaCy pipeline or blank:en etc. for blank model, used to load DocBin"),
    include_ner: bool = typer.Option(False, "--ner", "-N", help="Include doc.ents as spans, if available"),
    include_textcat: bool = typer.Option(False, "--textcat", "-T", help="Include doc.cats as options, if available")
    # fmt: on
):
    """Load a binary .spacy file and output annotations in Prodigy's format."""
    nlp = spacy.load(spacy_model)
    doc_bin = DocBin().from_disk(path)
    examples = convert_examples(
        nlp, doc_bin, include_ner=include_ner, include_textcat=include_textcat
    )
    for eg in examples:
        print(eg)


def convert_examples(
    nlp: Language,
    doc_bin: DocBin,
    include_ner: bool = False,
    include_textcat: bool = False,
) -> Iterator[Dict[str, Any]]:
    docs = doc_bin.get_docs(nlp.vocab)
    for doc in docs:
        eg = {"text": doc.text, "tokens": [get_token(token, token.i) for token in doc]}
        if include_ner:
            spans = [
                {"start": ent.start_char, "end": ent.end_char, "label": ent.label_}
                for ent in doc.ents
            ]
            eg["spans"] = sync_spans_to_tokens(spans, eg["tokens"])
        if include_textcat:
            eg["options"] = [{"id": cat, "text": cat} for cat in doc.cats]
            eg["accept"] = [cat for cat, score in doc.cats.items() if score == 1.0]
        yield eg


if __name__ == "__main__":
    typer.run(main)

In a custom recipe, you could also include the convert_examples function directly, or customise it to provide different annotations. For instance, you could use the new spans.manual with span annotations defined under a given key of doc.spans. You can extract the annotations the same way you'd normally access them on a processed Doc object.

2 Likes