Overlapping Entities

MalikRumi · August 11, 2023, 9:32pm

I wanted to be clear about tokenization and collocations (I prefer using this term because I have found it to be more specific in an industry where people use "spans", "ngrams", "phrases", etc to refer to any span of words that happen to be next to each other in an arbitrary sentence).

When you say entities can't overlap, do you mean "... when they are next to each other in the same sentence"? I assume this is the meaning, because otherwise you couldn't have "New England", "New York", and "New Mexico" in the same document, corpus, or dataset , right?

I'm working with a large dataset of US Government documents. Here is an example of the entity issues I have to deal with:

"The President of the United States, Joseph R. Biden" is 9 words, with at least the following 4 candidate sub entities:

President
President of the United States
United States
Joseph R. Biden

I assume I can customize the tokenizer to accept all five of these in my entities collection if I capture them separately, and the only difficulty would be if I tried to make them all at once from the "parent" sequence we started with.

Furthermore, I can then use the EntityRuler to link them to the proper disambiguations in the knowledge base, where 1, 2, & 4 would all point to "President Biden" - which is itself accepted as an entity.

Is this correct? Feel free to expound upon any faulty assumptions you see in my question.

Thanks

koaning · August 14, 2023, 12:44pm

The tokenizer will split the text into tokens and then spaCy will use these tokens to detect sequences that might be of interest. In the case of named entity recognition (NER) these detected spans may not overlap, but in the case of spancat they are allowed to overlap. Have you seen this announcement blogpost?

If you're interested in detecting spans that may overlap, you may want to use the SpanRuler instead of the EntityRuler.

Does this help?

MalikRumi · August 20, 2023, 10:08am

No, I had not seen this blog post before now. It has taken me nearly an hour to write this response because the link you gave me set me off a deep dive I hadn't planned on at all . Thanks.

Topic		Replies	Views
Custom recipe for Annotating Overlapping Spans custom , front-end , best-practices	15	2503	September 6, 2020
What happens if your annotation has overlapping entity spans? usage , spacy	8	8696	January 12, 2024
Overlapping NER usage , ner , spacy	2	337	July 1, 2021
SPAN or NER for topic identification over large sentences ner , spancat	2	440	November 13, 2022
Multi-label NER usage , ner	1	1626	April 25, 2021

Overlapping Entities

Related topics