Prodigy and DVC (Data Version Control)

Dear Support,

In my team we have started adopting DVC to track our ML data and models. I have read here that since spaCy 3.0 there will be stricter integration between spaCy and DVC, so in the following I will assume you guys are familiar with DVC—if this is not the case please tell me and I will try to clarify my question.

When I collect annotations with Prodigy, they end up being stored all in the same prodigy.db.
Now I would dvc add prodigy.db, but if in the future I collect new annotations (which are again stored in prodigy.db), then I will need to call dvc add prodigy.db. But then DVC will store a new version of prodigy.db in the cache at any dvc add, which is inconvenient.

I then thought about using one db per set of annotations. But then, when I call prodigy train ner I think it is only possible to specify the db location in the prodigy.json file, and it is not possible to pass it as an argument of these commands. Which makes DVC incompatible with this solution.

In conclusion: how would you suggest to combine DVC with a prodigy.db containing more and more annotations sets?

Thank you in advance for your kind help!

Hi! And yes, we've been using DVC for some of our stuff internally, and spaCy v3 will make it easy to generate a DVC pipeline from an end-to-end spaCy workflow and use DVC to manage your assets :slightly_smiling_face: It'd be cool to also be able to recommend a workflow/integration for Prodigy so if you end up experimenting with this, definitely keep us updated!

While it's always a good idea to keep a backup of the prodigy.db (obviously), it might make more sense to track on a more fine-grained level with DVC, e.g. by dataset. This also lets you get more out of the version control aspect, and related features – like, you could re-trigger training runs if the data has changed, and skip if no new annotations were created.

One way to implement this would be to have an automated process that calls into db-out (or the respective Python database API) and periodically exports your datasets to JSONL files – for example, into a remote storage bucket. You can then track all of those files with DVC, check meta files into a repo (e.g. the data used for a specific project) and so on.

Dear Ines,

Thank you for your prompt reply. We are now trying to follow your suggestions and use DVC to track the JSONL files that are generated by prodigy db-out.

In a nutshell our process follows the following steps:

  1. use prodigy ner.manual to collect a new set of annotations, which are automatically sent to prodigy.db
  2. use prodigy db-out to obtain the JSONL of this new set of annotations
  3. call dvc add new_annotations.jsonl to track the new version of the file
  4. whenever training of a model is required (e.g. prodigy train ner) we first of all call prodigy db-in to push the JSONL into the db

This approach, however, makes the fundamental assumption that db-in and db-out will never break backward compatibility and produce always the same results. Can you guarantee that this is the case?

Thank you in advance!


Cool, thanks for sharing! :+1: Btw, do you need the extra db-in step, though, if you know that the state of the database always matches your file? It just seems like things could also go wrong here and you'll accumulate a lot of datasets (because every time you import, you'll have to create a new set).

If you're this serious about managing your workflows etc., you might also consider training with spaCy directly, which gives you much more control over the process and doesn't rely on the Prodigy database. (prodigy train is really just a wrapper around spaCy's training API.) You can use data-to-spacy to convert your Prodigy dataset to a spaCy training file – although, the exact output here may change as spaCy changes.

The db-out command is super lightweight and it pretty much dumps the exact JSON stored in the database into a file. Datasets in Prodigy are append-only, so once an example is added, it doesn't change. So unless you delete and re-add a dataset under the same name or modify the SQLite database manually, the output will always reflect whatevr is in the dataset.