How to declare and use validation set in ner.train

I am trying to train a model as follows:

!Python -m prodigy train ./tmp_model --ner food_annotations --base-model en_core_web_lg --eval-split 0.10 --config config.cfg

I would like to have a 60/30/10 split for train/validation/evaluation, so that I can train the model on 60% of the data, then run validation on 30% of the data to generalize it and then evaluate the model on 10% of the data. Is there any way to define a validation set in this recipe? Is there another recipe that I can do this with and how would I do that? I have looked through the documentation and other support questions but haven't found an answer, I might not have been able to properly structure my query to find the right previous support question, so I thought I would ask.

Thank you for your help!

Hi @saad.moosa :slight_smile:

You can do it via the --eval-split setting. Prodigy should automatically do the split.

But if you want to also split off a portion for testing/validation and serious about training, then you might need separate datasets. You can export the data and split it in which way you like, and even re-import it into new datasets.

1 Like

still, have some questions we can do "cross-validation " by Prodigy? (specially in span categorize)

Hi @myeghaneh , spaCy v3 (in which Prodigy calls under the hood) doesn't have a built-in cross-validation scheme. Ideally, you'd want a true random sample of your data to test upon.

1 Like