Overlapping Entities

I wanted to be clear about tokenization and collocations (I prefer using this term because I have found it to be more specific in an industry where people use "spans", "ngrams", "phrases", etc to refer to any span of words that happen to be next to each other in an arbitrary sentence).

When you say entities can't overlap, do you mean "... when they are next to each other in the same sentence"? I assume this is the meaning, because otherwise you couldn't have "New England", "New York", and "New Mexico" in the same document, corpus, or dataset , right?

I'm working with a large dataset of US Government documents. Here is an example of the entity issues I have to deal with:

"The President of the United States, Joseph R. Biden" is 9 words, with at least the following 4 candidate sub entities:

President
President of the United States
United States
Joseph R. Biden

I assume I can customize the tokenizer to accept all five of these in my entities collection if I capture them separately, and the only difficulty would be if I tried to make them all at once from the "parent" sequence we started with.

Furthermore, I can then use the EntityRuler to link them to the proper disambiguations in the knowledge base, where 1, 2, & 4 would all point to "President Biden" - which is itself accepted as an entity.

Is this correct? Feel free to expound upon any faulty assumptions you see in my question.

Thanks

The tokenizer will split the text into tokens and then spaCy will use these tokens to detect sequences that might be of interest. In the case of named entity recognition (NER) these detected spans may not overlap, but in the case of spancat they are allowed to overlap. Have you seen this announcement blogpost?

If you're interested in detecting spans that may overlap, you may want to use the SpanRuler instead of the EntityRuler.

Does this help?

No, I had not seen this blog post before now. It has taken me nearly an hour to write this response because the link you gave me set me off a deep dive I hadn't planned on at all :stuck_out_tongue_winking_eye:. Thanks.

1 Like