Hi @NIX411,
Your proposed workflow does make sense with a couple of caveats:
- re: step 3: Currently there is no separate built-in command for validating that there is no overlap between the datasets. As of next Prodigy release the improved
--eval-split
functionality will take care of this. For now, though, you'd have to validate it yourself. Please check this thread , which discusses this very issue. - re step 4: Whether downsampling the majority class (as you propose) is a good idea, depends on how much data you have overall. If it's not too much, it might be considered wasteful. You might want to look at other techniques as well such as upsampling or a combination thereof. It is recommended to experiment with some of techniques for dealing with unbalanced text datasets and see what works best in your scenario.
And about the bonus question: Once you are done with the annotation and have exported your data using data-to-spacy
you no longer depend on Prodigy for the development of your spaCy project. Doing the conversion other way round i.e from spaCy, would entail the dependence of spaCy on Prodigy which is something we really do not want to introduce.
Hope that helps!