how does prodigy data-to-spacy --eval-split do the split?

What is the strategy for creating the eval split? Is any type of stratification used?

Cheers,
Ivan

HI @ivan , the --eval-split parameter performs a straightforward cut of the dataset based on the percentage you passed (usually 0.2). If you want a more complex split, it may be better to do it as your preprocessing step and just passed the .spacy files with the split you want.

1 Like

Makes sense. I have been a bit lazy relying on the random split for every new labelling campaign so will need to get that under control, it is also slightly complicated to keep a constant validation set when the training set is constantly changing with new annotations. I have to have a way of recording what is in the validation set vs what I might want to add to the validation set based on the new annotations.