Is it possible to run batch train on a file

taavi · January 31, 2019, 7:25pm

From documentation it seems that batch-train expects the dataset to be in the DB right now, is there a way to train directly on a jsonl datafile? Alternatively it’s not super hard to dump the file to the DB as a dataset, but maybe there’s a way around the extra step…

ines · January 31, 2019, 8:16pm

The batch train recipes are especially optimised for running quick experiments and for training with sparse annotations, i.e. binary annotations collected with recipes like ner.teach. That’s also why they work on datasets, since this is usually where those annotations are stored.

If you want to train a model from an existing dataset you already have (like, from a different source), it might be easier to use spaCy directly. See the training docs and spacy train for details. Alternatively, you could import your data into a Prodigy dataset, or write your own custom version of the recipe that loads from a file or a different source.

taavi · January 31, 2019, 10:19pm

The use case is when we do consensus labeling for small-ish datasets by spinning up multiple prodigy instances on the same input file (one per labeler) and then consolidating the results into a single data file to run prodigy batch train on. Spacy training looks quite a bit more involved, but custom recipe or feeding data back into the DB are definitely solid options. Just wanted to make sure I’m not missing something obvious before doing that.

ines · January 31, 2019, 10:29pm

Ah yeah, that makes sense. It’s true that we currently don’t have a very smooth workflow for consolidating multiple datasets (that’s definitely something we want to add in the future). For now, I think the best solution would be a script that loads the datasets from the DB, merges them, (possibly validates them accordiong to your criteria?) and then adds them to a combined dataset.

Topic		Replies	Views
Prodigy ner.batch-train vs Spacy train usage , spacy , best-practices	13	3494	June 2, 2020
ner.batch_train vs spacy nlp.begin_training ner , spacy	1	1098	January 26, 2018
Best strategy for training an NER engine usage , ner	8	2176	December 27, 2017
How to download the dataset I annotated using the prodigy tool in json format？ Getting Started database	3	1111	March 6, 2023
Using the output of ner.gold-to-spacy to train a new model ner , spacy	3	1053	April 4, 2018

Is it possible to run batch train on a file

Related topics