I wanted to be clear about tokenization and collocations (I prefer using this term because I have found it to be more specific in an industry where people use "spans", "ngrams", "phrases", etc to refer to any span of words that happen to be next to each other in an arbitrary sentence).
When you say entities can't overlap, do you mean "... when they are next to each other in the same sentence"? I assume this is the meaning, because otherwise you couldn't have "New England", "New York", and "New Mexico" in the same document, corpus, or dataset , right?
I'm working with a large dataset of US Government documents. Here is an example of the entity issues I have to deal with:
"The President of the United States, Joseph R. Biden" is 9 words, with at least the following 4 candidate sub entities:
President
President of the United States
United States
Joseph R. Biden
I assume I can customize the tokenizer to accept all five of these in my entities collection if I capture them separately, and the only difficulty would be if I tried to make them all at once from the "parent" sequence we started with.
Furthermore, I can then use the EntityRuler to link them to the proper disambiguations in the knowledge base, where 1, 2, & 4 would all point to "President Biden" - which is itself accepted as an entity.
Is this correct? Feel free to expound upon any faulty assumptions you see in my question.
Thanks