Thinking about this some more, I think you've definitely hit an interesting edge case here and we should adjust the way we do the per-component eval split and first group examples together by input hash. It's perfectly fine to have multiple annotations on the same input – in fact, this is a common workflow if you annotate one label at a time and then want to group them all together in the same corpus. But if we do the split on the whole dataset, you can easily end up with an input hash present in both the training and evaluation data, which is unideal. (The debug data
command currently only checks for identical texts, so it doesn't take different annotations into account. So there might be cases where an example is in the training data with some label annotations and in the dev data with other annotations. This isn't great, but also not as bad as having an exact duplicate.)
The good news is, unless it's lots of examples that you have in both the training and dev data, it's likely not going to make a huge difference during training. You can also filter out the duplicates yourself in the meantime by loading the .spacy
file from disk and creating a new DocBin
without the examples that are also in the training data: https://spacy.io/api/docbin#from_disk