Is it possible to run batch train on a file

From documentation it seems that batch-train expects the dataset to be in the DB right now, is there a way to train directly on a jsonl datafile? Alternatively it’s not super hard to dump the file to the DB as a dataset, but maybe there’s a way around the extra step…

The batch train recipes are especially optimised for running quick experiments and for training with sparse annotations, i.e. binary annotations collected with recipes like ner.teach. That’s also why they work on datasets, since this is usually where those annotations are stored.

If you want to train a model from an existing dataset you already have (like, from a different source), it might be easier to use spaCy directly. See the training docs and spacy train for details. Alternatively, you could import your data into a Prodigy dataset, or write your own custom version of the recipe that loads from a file or a different source.

The use case is when we do consensus labeling for small-ish datasets by spinning up multiple prodigy instances on the same input file (one per labeler) and then consolidating the results into a single data file to run prodigy batch train on. Spacy training looks quite a bit more involved, but custom recipe or feeding data back into the DB are definitely solid options. Just wanted to make sure I’m not missing something obvious before doing that.

Ah yeah, that makes sense. It’s true that we currently don’t have a very smooth workflow for consolidating multiple datasets (that’s definitely something we want to add in the future). For now, I think the best solution would be a script that loads the datasets from the DB, merges them, (possibly validates them accordiong to your criteria?) and then adds them to a combined dataset.