data-to-spacy eval-split doesn't seem to have any effect

I'm using the data-to-spacy receipt to convert some datasets into a format for training with spaCy. The --eval-split argument doesn't seem to work, though. No matter what I specify, I'm getting an even 50/50 split. Is there some trick to getting a different split?

Here's the command I'm running and its output.

prodigy data-to-spacy \
--ner SupplierCatalog_10000-aa-0_0_1,SupplierCatalog_10000-ab-0_0_1,SupplierCatalog_10000-ac-0_0_1,SupplierCatalog_10000-ad-0_0_1,SupplierCatalog_10000-ae-0_0_1 \
--eval-split 0.2 \
SupplierCatalog_10000-0_0_1-train.json \
SupplierCatalog_10000-0_0_1-eval.json
ℹ Starting with language en
Created and merged data for 833 total examples

Type   Total   Merged
----   -----   ------
NER      850      833

Using 417 train / 416 eval (split 50%)
✔ Saved 416 examples to SupplierCatalog_10000-0_0_1-eval.json
✔ Saved 417 examples to SupplierCatalog_10000-0_0_1-train.json

Thanks for the report – looks like this is a bug and the value isn't passed through correctly :woman_facepalming: I've already fixed it and it'll be included in the next release. In the meantime, here's a quick patch:

Find the location of your Prodigy installation (prodigy stats will show you the path), open recipes/train.py and find the following line in the data_to_spacy function (should be line 327):

json_data, eval_data, split = train_eval_split(json_data)

Then change that to:

json_data, eval_data, split = train_eval_split(json_data, eval_split)
2 Likes

That works. Thanks, @ines!