Prodigy and DVC (Data Version Control)

Dear Support,

In my team we have started adopting DVC to track our ML data and models. I have read here that since spaCy 3.0 there will be stricter integration between spaCy and DVC, so in the following I will assume you guys are familiar with DVC—if this is not the case please tell me and I will try to clarify my question.

When I collect annotations with Prodigy, they end up being stored all in the same prodigy.db.
Now I would dvc add prodigy.db, but if in the future I collect new annotations (which are again stored in prodigy.db), then I will need to call dvc add prodigy.db. But then DVC will store a new version of prodigy.db in the cache at any dvc add, which is inconvenient.

I then thought about using one db per set of annotations. But then, when I call prodigy train ner I think it is only possible to specify the db location in the prodigy.json file, and it is not possible to pass it as an argument of these commands. Which makes DVC incompatible with this solution.

In conclusion: how would you suggest to combine DVC with a prodigy.db containing more and more annotations sets?

Thank you in advance for your kind help!

Hi! And yes, we've been using DVC for some of our stuff internally, and spaCy v3 will make it easy to generate a DVC pipeline from an end-to-end spaCy workflow and use DVC to manage your assets :slightly_smiling_face: It'd be cool to also be able to recommend a workflow/integration for Prodigy so if you end up experimenting with this, definitely keep us updated!

While it's always a good idea to keep a backup of the prodigy.db (obviously), it might make more sense to track on a more fine-grained level with DVC, e.g. by dataset. This also lets you get more out of the version control aspect, and related features – like, you could re-trigger training runs if the data has changed, and skip if no new annotations were created.

One way to implement this would be to have an automated process that calls into db-out (or the respective Python database API) and periodically exports your datasets to JSONL files – for example, into a remote storage bucket. You can then track all of those files with DVC, check meta files into a repo (e.g. the data used for a specific project) and so on.

Dear Ines,

Thank you for your prompt reply. We are now trying to follow your suggestions and use DVC to track the JSONL files that are generated by prodigy db-out.

In a nutshell our process follows the following steps:

  1. use prodigy ner.manual to collect a new set of annotations, which are automatically sent to prodigy.db
  2. use prodigy db-out to obtain the JSONL of this new set of annotations
  3. call dvc add new_annotations.jsonl to track the new version of the file
  4. whenever training of a model is required (e.g. prodigy train ner) we first of all call prodigy db-in to push the JSONL into the db

This approach, however, makes the fundamental assumption that db-in and db-out will never break backward compatibility and produce always the same results. Can you guarantee that this is the case?

Thank you in advance!

—Francesco

Cool, thanks for sharing! :+1: Btw, do you need the extra db-in step, though, if you know that the state of the database always matches your file? It just seems like things could also go wrong here and you'll accumulate a lot of datasets (because every time you import, you'll have to create a new set).

If you're this serious about managing your workflows etc., you might also consider training with spaCy directly, which gives you much more control over the process and doesn't rely on the Prodigy database. (prodigy train is really just a wrapper around spaCy's training API.) You can use data-to-spacy to convert your Prodigy dataset to a spaCy training file – although, the exact output here may change as spaCy changes.

The db-out command is super lightweight and it pretty much dumps the exact JSON stored in the database into a file. Datasets in Prodigy are append-only, so once an example is added, it doesn't change. So unless you delete and re-add a dataset under the same name or modify the SQLite database manually, the output will always reflect whatevr is in the dataset.

1 Like

Hi! I also trying to get better feel for what option would work better over time:

  1. db-out json is stored in DVC. It requires to call data-to-spacy each time the data is used for training, resulting in a dependency on the Prodigy version.
  2. data-to-spacy output is stored in DVC. It removes the dependency on Prodigy version for training a Spacy model. But updating annotations in Prodigy would require a command that converts DocBin back into Prodigy JSON. That might require a custom user data (answers, etc) to be added into DocBin by data-to-spacy

The option 1 is ready to try right now. The option 2 requires additional changes. In both options I'm considering ProdigyDB as a local data on my dev machine that can be recreated from data stored in DVC. It might be a wrong idea :slight_smile:

What do you think?
Thanks!

The problem here is that you couldn't easily reconstruct the original annotation examples just from the combined data-to-spacy output. Keep in mind that data-to-spacy doesn't only reformat – it also merges annotations on the same text, even across different types (NER, text classification) so you end up with one example per unique text. So I do think treating data-to-spacy as a preprocessing step in DVC would be a better solution.

Alternatively and depending on the complexity of your annotations, you could also implement your own conversion script, which could run as a step of your DVC pipeline and would be independent of Prodigy. In spaCy v3, creating and converting training data programmatically is much easier, because it's all Doc objects. For example, if your annotations are text classification annotations created with the choice UI, your conversion could look like this:

for eg in examples:
    if eg["answer"] == "accept":
        labels = [option["id"] for option in eg["options"]]
        doc = nlp.make_doc(eg["text"])
        doc.cats = {label: label in eg["accept"] for label in labels}
        docbin.add(doc)
1 Like

Thank you!
I really overlooked that 'data-to-spacy' merges annotations.
Also I like the idea I learned in your answer that the 'db-out' output is not a black box, but just JSON that could be converted with Spacy v3 API.