Handling train / dev / test in Prodigy

lnatprodigy · July 22, 2021, 5:40am

Hi,

My understanding is that when I call data-to-spacy, Prodigy will put some random samples of data into the dev set, so when I call the recipe twice, then I might end up validating with a different dev set.
During development, while I'm still figuring out how to best prepare my data that's fine but I'd like to have a static test set, to validate my "final" model.

Right now, my idea is to just have a seperate dataset and export it with -es 0, to have 1 set of data that I will use for final validation.

Does that make sense, or is there a better way to do this with Prodigy, that I'm missing?
Maybe I'm wrong about data-to-spacy?

Any pointers would be much appreciated!

SofieVL · July 22, 2021, 6:44am

Hi!

Your understanding is correct, and your analysis & solution sounds very reasonable to me as well. The automatic (random) split can be used for quick experimenting, but for larger experiments I'd always recommend having a hold-out dataset as you said.

You can indeed make sure that everything ends up in the training dataset by setting -es 0, or you can specify another dataset as the dev set with eval:other_dataset, so you'd have something like --ner train_dataset,eval:dev_dataset. Or you just export with data_to_spacy and then you can still change the train and dev paths as you like.

Let us know if you run into any issues with that!

ines · July 22, 2021, 7:17am

One quick additional note: The above applies to the new train and data-to-spacy commands in v1.11 (currently nightly) that allow training multiple components with separate training and evaluation datasets. In the latest stable version, you can provide the training and eval datasets via separate arguments.

lnatprodigy · July 22, 2021, 7:37am

Thanks everyone, that was very helpful, as always!

Topic		Replies	Views
How to extract dev set from prodigy train recipe usage , nightly	2	605	May 20, 2021
How to declare and use validation set in ner.train usage , training	3	479	January 12, 2022
Can I replicate "prodigy train --ner ds_<dataset_name> ./models --eval-split 0.25 -L" within Python? ner , spacy	1	279	October 19, 2023
how to test my model on new dataset ? usage , spacy , solved	2	943	April 26, 2020
SpaCy training from data-to-spacy output ? usage , spacy	8	1816	June 14, 2022

Handling train / dev / test in Prodigy

Related topics