data-to-spacy training examples also in evaluation data

Hi, I ran into an issue when using the data-to-spacy recipe.

After data-to-spacy, when I run spacy debug data I get the warning: ⚠ 411 training examples also in evaluation data.

Can this be caused by duplicate _input_hash in the Prodigy database?

I've already made sure not to have duplicate _task_hash in the database.

And could data-to-spacy be made to check for this and deduplicate examples before export?

I'm using Prodigy 1.11.6 and spaCy 3.2.0.

Hi! This is defintely very strange – can you share some more details on how you collected the data in Prodigy and how you configured data-to-spacy? Do you let Prodigy split the data into training/evaluation sets or do you use separate datasets? Are you merging annotations for different tasks?

The data-to-spacy command will use the _input_hash to merge all annotations on the same input, so in theory, it should be impossible to end up with duplicates :thinking: The only possible explanation I could think of is that there's maybe a problem with how annotations from different dataseet types are merged... Prodigy will use the same eval split for all the different components (so you end up with 20% of NER and 20% of textcat examples and it's not possible to end up with no examples of a given component in the eval data). So maybe example A ends up in the training data for component 1 and in the evaluation data for component 2. We'll definitely investigate!

This is the command I use:

prodigy data-to-spacy corpus/ --ner datset1,dataset2 --lang "en" --eval-split 0.2 --verbose

So it's only NER and I let Prodigy split training/evaluation. When I collected the annotations in Prodigy I used ner.manual and ner.correct.

The warning also appears when I only use one of the datasets. Both seem to be affected. I collected them with different versions of Prodigy. Could it be related to using annotations collected with old versions?

Ah, that's strange then :thinking: How old was the previous version of Prodigy that you used? It's definitely unlikely that older versions would have produced different input hashes for the same document, since the hashing hasn't changed and is just based on the text for NER. But you could check for this pretty easily by iterating over your examples and looking for annotations with identical texts (and potentially different input hashes). If you do find cases of this, then that'd explain what's going on (although it'd still be pretty mysterious how this could have happened).

Prodigy v1.8.5 and newer versions were used.

I've attempted the check that you described:

import jsonlines
from itertools import compress

input = "prodigy_data.jsonl"

hashes = []
texts = []
with jsonlines.open(input) as reader:
  for obj in reader:
    texts.append(obj["text"])
    hashes.append(obj["_input_hash"])

for i in range(len(texts)):
  text_i = texts.pop(0)
  hash_i = hashes.pop(0)
  duplicate_texts = [text == text_i for text in texts]
  if any(duplicate_texts):
    hashes_i = list(compress(hashes, duplicate_texts))
    assert all([x == hash_i for x in hashes_i])

Running it doesn't give any assertion errors.

Can you think of any common mistakes that I might've done handling the data in Prodigy or when using data-to-spacy?