In my team we have started adopting DVC to track our ML data and models. I have read here that since spaCy 3.0 there will be stricter integration between spaCy and DVC, so in the following I will assume you guys are familiar with DVC—if this is not the case please tell me and I will try to clarify my question.
When I collect annotations with Prodigy, they end up being stored all in the same
Now I would
dvc add prodigy.db, but if in the future I collect new annotations (which are again stored in
prodigy.db), then I will need to call
dvc add prodigy.db. But then DVC will store a new version of
prodigy.db in the cache at any
dvc add, which is inconvenient.
I then thought about using one db per set of annotations. But then, when I call
prodigy train ner I think it is only possible to specify the db location in the
prodigy.json file, and it is not possible to pass it as an argument of these commands. Which makes DVC incompatible with this solution.
In conclusion: how would you suggest to combine DVC with a
prodigy.db containing more and more annotations sets?
Thank you in advance for your kind help!