Automating daily training

jrfernandez · December 11, 2020, 3:53am

I would like to automate the training of my textcat model on a daily basis. I have a process to crowdsource the collection of training data and I want to automate re-training the model.

My first intuition is to execute the prodigy train command with daily cron job. This is assuming a different scheduled job produces the updated training jsonl before that runs.

This seems to be a simple solution, but I'm curious what benefits I'd get if instead, I write this as a Python script that calls spacy directly.

ines · December 11, 2020, 10:35pm

Hi! This is a good question and I think if you're this "serious" about training your model, it probably makes the most sense to train with spaCy directly instead of using Prodigy's wrapper, which is mostly optimised for quick local experiments.

You probably also want to keep backups of the exact training corpus to use for each run, so you can easily roll back to a previous version, or re-run last week's training to compare results etc. Also make sure to use a dedicated evaluation set, so you're evaluating on the same data every day and can actually compare the results in a meaningful way.

So I'd suggest a step that look something like this:

Run prodigy data-to-spacy to export your annotations as spaCy training data (which makes it easy to use the current date/time as the filename and keeps a record of the exact data that was used to produce the model).
Run spacy train with your exported training and your dedicated evaluation set.
Test your model, log your progress, compare it to the previous day etc.

Btw, if you haven't seen it yet, you might find the new spaCy projects feature that's coming to spaCy v3.0 useful, which was designed with exactly those use cases in mind:

Using the new project templates, you'll be able to define a series of commands to run in order (export data, train, evaluate, log resuls, visualize etc.). Commands only re-run if the inputs have changed, and you can easily sync state with a local or remote storage to back up your assets and artifacts. So all you'd have to do is configure a cronjob (or similar) to execute spacy project run all once a day, and that's it

jrfernandez · December 11, 2020, 11:36pm

Ines, thank you for the detailed response. I wasn't aware of the v3.0 projects feature you mentioned. I will definitely go that route, and provide any feedback based on my experience.

ines · December 12, 2020, 1:16am

Sounds great! Just keep in mind that the current version of Prodigy uses the stable spaCy v2.x and isn't compatible with the spaCy v3.x nightly release. But you don't necessarily need to run Prodigy in the same environment – you could also have one env to export the data and then continue in the spaCy v3.x env

jrfernandez · December 12, 2020, 5:42am

@ines, is there a nightly build of Prodigy that uses spaCy v3.x nightly?

ines · December 13, 2020, 10:33pm

Not yet, because there are various internals that need to change to support some of the cool new v3 features. But we'll be posting updates on this thread:

Topic		Replies	Views
Using transformer models inside prodigy and finetuning enhancement , usage , transformers	10	3600	May 1, 2020
Feeding prodigy annotated data to spacy in python usage , spacy , training	4	649	October 8, 2021
✨ Prodigy nightly: spaCy v3 support, UI for overlapping spans & more meta , done , spacy , news , nightly	113	12688	January 20, 2022
ner.batch_train vs spacy nlp.begin_training ner , spacy	1	1098	January 26, 2018
Book usage	1	394	March 4, 2022

Automating daily training

Related topics