Productionising Prodigy datasets

monsoon · November 20, 2024, 10:54am

In our system we have reached the point where we have certain Prodigy datasets stored within a MySQL database. These datasets can be converted to .spacy files, and hooked into the spacy train command, and we've attached MLFlow logging, and model logging/registration as part of that workflow.

My question is around whether there are any recommendations or guidelines into more effectively managing the Prodigy -> Spacy process, especially with registering .spacy files as data assets in Azure ML. Is there any guidance into making the transition of creating/collecting datasets as seamless as possible within a production system?

In my mind, the ideal scenario would look something like: an annotator has reached a point where they're happy with the Prodigy model outputs. They collect the datasets into a gold version. This gold dataset is then processed with data-to-spacy, and the resulting asset is stored and versioned within Azure ML. These versioned and tracked .spacy files are then used as standard to train/log/register the resulting models.

I have looked, and there seems to be very little online in terms of orchestrating an entire Prodigy -> Spacy workflow within Azure ML (especially with respect to dataset management)- so would highly appreciate any pointers.

honnibal · November 20, 2024, 10:59am

Is there any guidance into making the transition of creating/collecting datasets as seamless as possible within a production system?

The short answer is not really, no. It makes sense to want this, but there are a lot of ML workflow technologies, and the ecosystem doesn't feel to us like it's converged on a consensus design or best practices. It feels more like different projects all want to do different things, based on which technologies they're using and the context of their usage.

We therefore leave this step up to the user for now. We try to just make it easy and clear how to get the output from Prodigy and spaCy, and let the user pass it over to different tools from there.

Topic		Replies	Views
Timeline for SpaCy 3 integration spacy , news	9	3110	February 8, 2021
Prodigy and DVC (Data Version Control) database , best-practices , third-party	6	1622	March 4, 2021
ML-FLOW Integration with prodigy usage , ner , training	1	228	September 12, 2022
prodigy train does not appear to support spacy-loggers spacy , training	2	495	November 9, 2022
SpaCy3 models evaluation on a custom dataset usage , spacy , solved , training	3	640	July 7, 2021

Productionising Prodigy datasets

Related topics