What is the strategy for creating the eval split? Is any type of stratification used?
Cheers,
Ivan
What is the strategy for creating the eval split? Is any type of stratification used?
Cheers,
Ivan
HI @ivan , the --eval-split
parameter performs a straightforward cut of the dataset based on the percentage you passed (usually 0.2). If you want a more complex split, it may be better to do it as your preprocessing step and just passed the .spacy
files with the split you want.
Makes sense. I have been a bit lazy relying on the random split for every new labelling campaign so will need to get that under control, it is also slightly complicated to keep a constant validation set when the training set is constantly changing with new annotations. I have to have a way of recording what is in the validation set vs what I might want to add to the validation set based on the new annotations.