data-to-spacy training examples also in evaluation data

ines · December 20, 2021, 10:52am

Thinking about this some more, I think you've definitely hit an interesting edge case here and we should adjust the way we do the per-component eval split and first group examples together by input hash. It's perfectly fine to have multiple annotations on the same input – in fact, this is a common workflow if you annotate one label at a time and then want to group them all together in the same corpus. But if we do the split on the whole dataset, you can easily end up with an input hash present in both the training and evaluation data, which is unideal. (The debug data command currently only checks for identical texts, so it doesn't take different annotations into account. So there might be cases where an example is in the training data with some label annotations and in the dev data with other annotations. This isn't great, but also not as bad as having an exact duplicate.)

The good news is, unless it's lots of examples that you have in both the training and dev data, it's likely not going to make a huge difference during training. You can also filter out the duplicates yourself in the meantime by loading the .spacy file from disk and creating a new DocBin without the examples that are also in the training data: https://spacy.io/api/docbin#from_disk

Topic		Replies	Views
prodigy data-to-spacy - retain metadata information enhancement , spacy	3	493	April 27, 2021
Reproducing prodigy ner.batch-train in spacy: cross-validation results and outputted model usage , ner	3	1873	October 5, 2018
data-to-spacy losing annotations ner	11	467	January 7, 2024
Data annotation : Query Regarding Data Annotation and Merging in Prodigy ner	1	16	January 10, 2025
Can I replicate "prodigy train --ner ds_<dataset_name> ./models --eval-split 0.25 -L" within Python? ner , spacy	1	279	October 19, 2023

data-to-spacy training examples also in evaluation data

Related topics