Feeding prodigy annotated data to spacy in python

mikkelyo · October 7, 2021, 12:30pm

I haven't been able to find a tutorial on how to use the prodigy annotated data in python - e.g the train.spacy and dev.spacy files (after using data-to-spacy). I just want to create a train_data and test_data list (or whatever) that i can train a spacy model with (using nlp.update) - is this possible? While I love the prodigy program, I'm not a fan of operating the training via the terminal.

Thanks in advance

kirilov · October 7, 2021, 2:24pm

You can directly use the binary (.spacy) files to train a model via the SpaCy CLI interface

mikkelyo · October 7, 2021, 2:27pm

Yeah I'm aware of that, and that is exactly what I'm trying to avoid

ines · October 8, 2021, 9:28am

Hi! Under the hood, the .spacy files are just serialized DocBins, so you can always load them back from disk and get a list of spaCy Doc objects: https://spacy.io/api/docbin#from_disk

Alternatively, you can always load your Prodigy annotations from Python by connecting to the database: https://prodi.gy/docs/api-database#database

That said, instead of implementing your own training loop in v3, we'd always recommend going via spaCy's training utilities because it'd otherwise be very difficult to get good results. There are just a lot of settings you need to get right in order to achieve optimal performance, and you wouldn't want to do all of this manually. If you really don't want to use the CLI, you can always call into spaCy's helpers yourself:

from spacy.training.loop import train
from spacy.training.initialize import init_nlp
from spacy.util import load_config
import sys

config_path = "/path/to/config.cfg"
overrides = {"paths.dev": "/path/to/dev.spacy", "paths.train": "/path/to/train.spacy"}
use_gpu = -1

config = load_config(config_path, overrides=overrides, interpolate=False)
nlp = init_nlp(config, use_gpu=use_gpu)
train(nlp, output_path, use_gpu=use_gpu, stdout=sys.stdout, stderr=sys.stderr)

mikkelyo · October 8, 2021, 9:41am

Thanks a lot, I'll check it out

Topic		Replies	Views
SpaCy training from data-to-spacy output ? usage , spacy	8	1814	June 14, 2022
Help updating spaCy v2 model usage , spacy	5	381	December 15, 2021
Script: Load data in spaCy v3's .spacy format Getting Started spacy , project , streams , nightly	4	2392	January 21, 2023
Book usage	1	394	March 4, 2022
Training prodigy ner data through spacy usage , ner , spacy , solved	3	893	January 8, 2020

Feeding prodigy annotated data to spacy in python

Related topics