Prodigy ner.batch-train vs Spacy train

Hi – this is a totally valid question :slightly_smiling_face:

Since Prodigy focuses a lot on usage as a developer tool, the built-in batch-train commands were also designed with the development aspect in mind. They’re optimised to train from Prodigy-style annotations and smaller datasets, include more complex logic to handle evaluation sets and output more detailed training statistics.

Prodigy’s ner.batch-train workflow also supports training from “incomplete” annotations out-of-the-box, e.g. a selection of examples biased by the score, and binary decisions collected with recipes like ner.teach. There’s not really and easy way to train from the sparse data formats created with the active learning workflow using spaCy – at least not out-of-the-box.

spaCy’s spacy train command on the other hand was designed for training from larger corpora, often annotated for several components (named entities, part-of-speech tags, dependencies etc.). It also supports more configuration options and settings to tune hyperparameters.

TL;DR: If you want to run quick experiments, train from binary annotations, or export prototype models from your Prodigy annotations, use the batch-train recipes. If you want to train your production model on a large corpus on annotations, use spacy train.