Validation within Prodigy (Cross Validation)

Hello Prodigy-Team,

I am currently training a binary classification model. I used 20% as Validation Data and 80% as Traindata (eval-split 0.2) The evaluation result already looks not bad. Now I want to ask whether a cross validation is possible within Prodigy - so that Prodigy do not use always the same data when I set the eval-split to 20% ?

  1. To achieve a model which is even better, I than first looked at: With which eval-split I can receive the best model. Do you think that this approach makes sense?

  2. After determining the best split I than want to play a little bit with the paramters "batch-size" and "n-iter" to further improve the model. After having the perfect model I than want to export the model and test it in a python environment on new datasets (datasets which prodigy havent seen, to see whether the model is overfitted). Does this makes sense in your eyes?

  3. By testing the exported model on the new data I have to set a threshold-score, so that the program knows at what score a dataset should be considered relevant. How do I determine such a threshold? Respectively, how does prodigy set such a threshold within the validation?

Thanks in advance!!!
Best regards
Nadine

Hi @NadineB,

We don't have cross-validation as a default recipe, as we usually find it's less useful than keeping a stable evaluation set. You can always split up the data yourself if you need to run it.

I do see a problem there. If you're looking at the different splits, you're changing both the training and evaluation. So you could as easily just be searching for which split happens to have the easiest examples in its evaluation.

It can make sense to tune the batch size and number of iterations. But you should take care when doing this on a small dataset: there's a lot of random variation, so you might not come to a reliable improvement --- you might just happen to improve on the few examples you're evaluating against.

Typically you would adjust the threshold based on whether you care more about false positives, or false negatives. If you care about them equally, a threshold of 0.5 seems fine.

1 Like

Thanks for the response!

Can you explain me how I can split up the data myself? A quick example, how it works, would be fantastic! :slight_smile:

Makes sense! But Prodigy also doesn't know whether I care more about False Positives or False Negatives. So Prodigy uses a threshold of 0.5 within the validation on the validationset?

Thanks in advance!

And I've got one more question @honnibal ! After making the cross validation where I found out what the best number of epochs and the best batch-size is, I want to train a model on the whole datset with the identified best parametre-values. So here I have to use 0% as eval-split so that the model uses all data for the training. But due to there are no evaluation data, how can prodigy choose the "best model"? Or do prodigy just takes the model from the last epoch?

Thanks in advance !!

You might find the data splitting functions in scikit-learn helpful: https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.KFold.html . They also have a lot of other utilities that might help your experiments.

Yes, that's correct.

Some people do this process of retraining on the whole dataset, so there are definitely people who'll advocate for that workflow. I'm in the other camp: I think it's really not a good idea, for the reason you mentioned. Without development data there's no way to choose between different models. You're also really vulnerable to something going wrong. Neural networks are a bit random, especially on small datasets: sometimes you get an unlucky initialisation or data order, and the model doesn't converge to a good solution. With no development data, you're running blind. You could get unlucky and your accuracy could have cratered on your last training, the one you're about to ship to production, and you'd never know.

So I say: just don't do that.