data-to-spacy training examples also in evaluation data

Hi, I ran into an issue when using the data-to-spacy recipe.

After data-to-spacy, when I run spacy debug data I get the warning: ⚠ 411 training examples also in evaluation data.

Can this be caused by duplicate _input_hash in the Prodigy database?

I've already made sure not to have duplicate _task_hash in the database.

And could data-to-spacy be made to check for this and deduplicate examples before export?

I'm using Prodigy 1.11.6 and spaCy 3.2.0.

Hi! This is defintely very strange – can you share some more details on how you collected the data in Prodigy and how you configured data-to-spacy? Do you let Prodigy split the data into training/evaluation sets or do you use separate datasets? Are you merging annotations for different tasks?

The data-to-spacy command will use the _input_hash to merge all annotations on the same input, so in theory, it should be impossible to end up with duplicates :thinking: The only possible explanation I could think of is that there's maybe a problem with how annotations from different dataseet types are merged... Prodigy will use the same eval split for all the different components (so you end up with 20% of NER and 20% of textcat examples and it's not possible to end up with no examples of a given component in the eval data). So maybe example A ends up in the training data for component 1 and in the evaluation data for component 2. We'll definitely investigate!

This is the command I use:

prodigy data-to-spacy corpus/ --ner datset1,dataset2 --lang "en" --eval-split 0.2 --verbose

So it's only NER and I let Prodigy split training/evaluation. When I collected the annotations in Prodigy I used ner.manual and ner.correct.

The warning also appears when I only use one of the datasets. Both seem to be affected. I collected them with different versions of Prodigy. Could it be related to using annotations collected with old versions?

Ah, that's strange then :thinking: How old was the previous version of Prodigy that you used? It's definitely unlikely that older versions would have produced different input hashes for the same document, since the hashing hasn't changed and is just based on the text for NER. But you could check for this pretty easily by iterating over your examples and looking for annotations with identical texts (and potentially different input hashes). If you do find cases of this, then that'd explain what's going on (although it'd still be pretty mysterious how this could have happened).

Prodigy v1.8.5 and newer versions were used.

I've attempted the check that you described:

import jsonlines
from itertools import compress

input = "prodigy_data.jsonl"

hashes = []
texts = []
with jsonlines.open(input) as reader:
  for obj in reader:
    texts.append(obj["text"])
    hashes.append(obj["_input_hash"])

for i in range(len(texts)):
  text_i = texts.pop(0)
  hash_i = hashes.pop(0)
  duplicate_texts = [text == text_i for text in texts]
  if any(duplicate_texts):
    hashes_i = list(compress(hashes, duplicate_texts))
    assert all([x == hash_i for x in hashes_i])

Running it doesn't give any assertion errors.

Can you think of any common mistakes that I might've done handling the data in Prodigy or when using data-to-spacy?

Unfortunately I still can't resolve the issue. I tried data-to-spacy without an evaluation split and then did the splitting with my own script. But still, debug data reports training examples also in evaluation data.

Is it possible that the problem is related to the issue of duplicate annotations in output? Duplicate annotations in output

Thinking about this some more, I think you've definitely hit an interesting edge case here and we should adjust the way we do the per-component eval split and first group examples together by input hash. It's perfectly fine to have multiple annotations on the same input – in fact, this is a common workflow if you annotate one label at a time and then want to group them all together in the same corpus. But if we do the split on the whole dataset, you can easily end up with an input hash present in both the training and evaluation data, which is unideal. (The debug data command currently only checks for identical texts, so it doesn't take different annotations into account. So there might be cases where an example is in the training data with some label annotations and in the dev data with other annotations. This isn't great, but also not as bad as having an exact duplicate.)

The good news is, unless it's lots of examples that you have in both the training and dev data, it's likely not going to make a huge difference during training. You can also filter out the duplicates yourself in the meantime by loading the .spacy file from disk and creating a new DocBin without the examples that are also in the training data: https://spacy.io/api/docbin#from_disk

1 Like

Thanks for taking another look at the issue. Your explanation makes things a lot clearer :slight_smile:

I'll go with the approach of filtering out the duplicates with a new DocBin.

Hi Paul,

I just wanted to let you know that we've started looking into these issues in more detail. Your report was very helpful and we're working on some fixes on our end to go into the next release.

I tried data-to-spacy without an evaluation split and then did the splitting with my own script.

This is surprising to me. I would assume that in this case, all annotations for the same text are merged into one example. I'll need to investigate this a bit more as well.

We'll keep you updated on our progress!

1 Like