Feeding prodigy annotated data to spacy in python

I haven't been able to find a tutorial on how to use the prodigy annotated data in python - e.g the train.spacy and dev.spacy files (after using data-to-spacy). I just want to create a train_data and test_data list (or whatever) that i can train a spacy model with (using nlp.update) - is this possible? While I love the prodigy program, I'm not a fan of operating the training via the terminal.

Thanks in advance

You can directly use the binary (.spacy) files to train a model via the SpaCy CLI interface

1 Like

Yeah I'm aware of that, and that is exactly what I'm trying to avoid

Hi! Under the hood, the .spacy files are just serialized DocBins, so you can always load them back from disk and get a list of spaCy Doc objects: https://spacy.io/api/docbin#from_disk

Alternatively, you can always load your Prodigy annotations from Python by connecting to the database: https://prodi.gy/docs/api-database#database

That said, instead of implementing your own training loop in v3, we'd always recommend going via spaCy's training utilities because it'd otherwise be very difficult to get good results. There are just a lot of settings you need to get right in order to achieve optimal performance, and you wouldn't want to do all of this manually. If you really don't want to use the CLI, you can always call into spaCy's helpers yourself:

from spacy.training.loop import train
from spacy.training.initialize import init_nlp
from spacy.util import load_config
import sys

config_path = "/path/to/config.cfg"
overrides = {"paths.dev": "/path/to/dev.spacy", "paths.train": "/path/to/train.spacy"}
use_gpu = -1

config = load_config(config_path, overrides=overrides, interpolate=False)
nlp = init_nlp(config, use_gpu=use_gpu)
train(nlp, output_path, use_gpu=use_gpu, stdout=sys.stdout, stderr=sys.stderr)

Thanks a lot, I'll check it out :slight_smile: