ner.batch_train vs spacy nlp.begin_training

Hi again -

I’m curious, as far as refining the NER model’s accuracy by training on docs representative of the end-game use case, is it recommended to batch-train spacy’s en_core_web_lg model in prodigy, or to use the annotated data from prodigy (from the ner.manual recipe, specifically) as a training data set to load with spacy’s nlp.begin_training method?

I’m unsure if the optimization methods are different – but the ETL to get the jsonl export into the data structure spacy can intake directly seems to suggest that’s not the intended route?


Since Prodigy focuses a lot on usage as a developer tool, the built-in batch-train commands were also designed with the development aspect in mind. They’re optimised to train from Prodigy-style annotations and smaller datasets, include more complex logic to handle evaluation sets and output more detailed training statistics.

Prodigy’s ner.batch-train workflow was also created under the assumption that annotations would be collected using ner.teach – e.g. a selection of examples biased by the score, and binary decisions only. There’s not really and easy way to train from the sparse data formats created with the active learning workflow using spaCy – at least not out-of-the-box.

The ner.manual is still pretty new, and we haven’t ourselves trained models entirely from annotations collected with this workflow. But there shouldn’t be a problem converting them to spaCy’s training format and we’re thinking about including a recipe with a future version of Prodigy that takes care of this. (See this thread for a discussion on the topic.)

1 Like