I would like to automate the training of my textcat model on a daily basis. I have a process to crowdsource the collection of training data and I want to automate re-training the model.
My first intuition is to execute the prodigy train command with daily cron job. This is assuming a different scheduled job produces the updated training jsonl before that runs.
This seems to be a simple solution, but I'm curious what benefits I'd get if instead, I write this as a Python script that calls spacy directly.
Hi! This is a good question and I think if you're this "serious" about training your model, it probably makes the most sense to train with spaCy directly instead of using Prodigy's wrapper, which is mostly optimised for quick local experiments.
You probably also want to keep backups of the exact training corpus to use for each run, so you can easily roll back to a previous version, or re-run last week's training to compare results etc. Also make sure to use a dedicated evaluation set, so you're evaluating on the same data every day and can actually compare the results in a meaningful way.
So I'd suggest a step that look something like this:
Run prodigy data-to-spacy to export your annotations as spaCy training data (which makes it easy to use the current date/time as the filename and keeps a record of the exact data that was used to produce the model).
Run spacy train with your exported training and your dedicated evaluation set.
Test your model, log your progress, compare it to the previous day etc.
Btw, if you haven't seen it yet, you might find the new spaCy projects feature that's coming to spaCy v3.0 useful, which was designed with exactly those use cases in mind:
Using the new project templates, you'll be able to define a series of commands to run in order (export data, train, evaluate, log resuls, visualize etc.). Commands only re-run if the inputs have changed, and you can easily sync state with a local or remote storage to back up your assets and artifacts. So all you'd have to do is configure a cronjob (or similar) to execute spacy project run all once a day, and that's it
Ines, thank you for the detailed response. I wasn't aware of the v3.0 projects feature you mentioned. I will definitely go that route, and provide any feedback based on my experience.
Sounds great! Just keep in mind that the current version of Prodigy uses the stable spaCy v2.x and isn't compatible with the spaCy v3.x nightly release. But you don't necessarily need to run Prodigy in the same environment – you could also have one env to export the data and then continue in the spaCy v3.x env
Not yet, because there are various internals that need to change to support some of the cool new v3 features. But we'll be posting updates on this thread: