Productionising Prodigy datasets

In our system we have reached the point where we have certain Prodigy datasets stored within a MySQL database. These datasets can be converted to .spacy files, and hooked into the spacy train command, and we've attached MLFlow logging, and model logging/registration as part of that workflow.

My question is around whether there are any recommendations or guidelines into more effectively managing the Prodigy -> Spacy process, especially with registering .spacy files as data assets in Azure ML. Is there any guidance into making the transition of creating/collecting datasets as seamless as possible within a production system?

In my mind, the ideal scenario would look something like: an annotator has reached a point where they're happy with the Prodigy model outputs. They collect the datasets into a gold version. This gold dataset is then processed with data-to-spacy, and the resulting asset is stored and versioned within Azure ML. These versioned and tracked .spacy files are then used as standard to train/log/register the resulting models.

I have looked, and there seems to be very little online in terms of orchestrating an entire Prodigy -> Spacy workflow within Azure ML (especially with respect to dataset management)- so would highly appreciate any pointers.

Is there any guidance into making the transition of creating/collecting datasets as seamless as possible within a production system?

The short answer is not really, no. It makes sense to want this, but there are a lot of ML workflow technologies, and the ecosystem doesn't feel to us like it's converged on a consensus design or best practices. It feels more like different projects all want to do different things, based on which technologies they're using and the context of their usage.

We therefore leave this step up to the user for now. We try to just make it easy and clear how to get the output from Prodigy and spaCy, and let the user pass it over to different tools from there.