Help with postprocessing annotated data for training multicategory text classification model

Glad it's working!

Once you get to a point where you want to be more specific about how you sample your training and test data (and you're not just running quick experiments to see if you're on the right track), you might want to do that in a separate step, yes.

Prodigy's default --eval-split setting on the train recipes will just hold back a given percentage of the (shuffled) training exaples. That's also how the data-to-spacy recipe does it, if you define a split. The --eval-id on the train recipe lets you pass in the name of a Prodigy dataset that should be used for evaluation. So in theory, you could also use that to provide your own custom evaluation set.

2 Likes