In our system we have reached the point where we have certain Prodigy datasets stored within a MySQL database. These datasets can be converted to .spacy
files, and hooked into the spacy
train command, and we've attached MLFlow logging, and model logging/registration as part of that workflow.
My question is around whether there are any recommendations or guidelines into more effectively managing the Prodigy -> Spacy process, especially with registering .spacy
files as data assets in Azure ML. Is there any guidance into making the transition of creating/collecting datasets as seamless as possible within a production system?
In my mind, the ideal scenario would look something like: an annotator has reached a point where they're happy with the Prodigy model outputs. They collect the datasets into a gold version. This gold dataset is then processed with data-to-spacy
, and the resulting asset is stored and versioned within Azure ML. These versioned and tracked .spacy
files are then used as standard to train/log/register the resulting models.
I have looked, and there seems to be very little online in terms of orchestrating an entire Prodigy -> Spacy workflow within Azure ML (especially with respect to dataset management)- so would highly appreciate any pointers.