ValueError: A Token can only be part of one entity [...]

Hey Everyone! I was following the "Training a new Entity Type" - YT Tutorial and suddenly got this Error:

ValueError: [E103] Trying to set conflicting doc.ents: '(94, 98, 'CONDITION')' and '(94, 98, 'CONDITION')'. A token can only be part of one entity, so make sure the entities you're setting don't overlap.

I can't really tell what's the cause for this :frowning:

The sentence which gave the Error was:
"Resolvi usar novamente a carnitina, depois de ler que ele resolve mtos problemas."

Maybe it gave me the Error because the Text was in a different language?

Thank you for your help and Greetings from Berlin City! :smiley:

Hi! The language is definitely not the problem here. What the error message is trying to tell you is that you somehow ended up with two entity annotations that overlap - or, in this case, are indentical.

I'm a bit confused how this could have happened – normally, Prodigy should only ever show you the same text once, so there's not really a way to generate exact duplicates, because you should never be asked the same thing twice.

When did this error occur? During annotation with ner.teach, or during training with ner.batch-train? Are you using the latest version of Prodigy? And coud you run the db-out command to export your dataset and try to find the sentence (e.g. in your editor)? Is it in there twice, or only once?

Edit: Can you check if you're using spaCy v2.2? Prodigy isn't officially compatible with the latest version yet, which introduces backwards-incompatible stricter handling of overlapping entities. If you're installing from the Prodigy wheel, it should auto-install the compatible spaCy version. Also see here:

1 Like

@ines thanks for the tip I noticed this after upgrading to Spacy v2.2

1 Like

Hi ines,

I have the same problem and in my case the problem is definitely the accured during the annotation with Brat. I extracted the texts with annotations that overlap and it looks like this:

train_data[0][0][1391 :1448]
Out[125]: 'carcinoma renal papilar de células claras y eosinofílicas'

train_data[0][0][1391:1414]
Out[126]: 'carcinoma renal papilar'

the espression and its subexpression both were annotated.

How should I deal with that?

If you want to use the data to train a named entity recognition model, you'd have to pick one of the spans and possibly adjust your annotation scheme so it's something the model can learn from most effectively. In the example you posted, the first one looks more like a whole subclause, right? This wouldn't really be a good fit anyways, and likely something a model would struggle to learn because it's pretty far from what's typically considered a named entity (e.g. a proper noun).

It's probably a good approach to prefer shorter spans – you should be able to do this programmatically by just iterating over your annotations, finding duplicates with overlapping start/end indices and filtering for the shortest span. You could also use Prodigy to stream in both versions of the text and manually select which one you prefer / which one makes most sense.