Entity Linker related questions

Hi,
I'm trying to use Entity Linking to distinguish between same name entities. As base I've used the "Emerson" example so well explained by Sofie (great video lesson).

The difficulties for me comes from the specific of the sentences I have. Here some examples:

Starship Enterprise 77 5766
Starship Enterprise TOG 77 NN  5766
Starship Enterprise 55783 ML 09
Starship Enterprise ML 09  55799 
Starship Enterprise TOG 1977 NN  5766
Starship Enterprise 55783 ML 2009

Starship Enterprise - this is the vessel name
77, 1977, 09, 2009 - are vessel built year
55783, 5766 - are vessel deadweight tonnage

I've added "Starship Enterprise" as entity in Entity Ruler and put the Entity Ruler before NER into the pipeline, so the entity is properly detected by the model.
As you can see except the "Starship Enterprise" which is the entity I want to distinguish there are numbers and abbreviations in the sentences I have.

Questions:

  1. Is this achievable using Entity Linker in regards of the structure of the sentences I have?
  2. Is there good approach when there is disbalance lots of numbers and fewer words in text examples?
  3. Should I pre-process the numbers and replaces them with tokens which are more meaningful to the model?

Thanks!

1 Like

Hi!

Great to hear the video tutorial's been useful to you!

What isn't clear to me from the explanation you've given, is the following:

  • Is there additional context to your sentences, or are they really only just containing the tokens "Starship Enterprise ML 09 55799" etc?

  • What do the entities look like in the knowledge base that you want to link to? Is there just one entity "Starship Enterprise", or are there different ones according to the built year / tonnage?

The current implementation of the entity linker really depends on the words in the sentence and compares those to a reference description you've given to each entity in your knowledge base.

Should I pre-process the numbers and replaces them with tokens which are more meaningful to the model?

Well it depends - are the numbers meaningful by themselves? If the built year is important to decide on the final entity ID you would assign, then it would be better to just leave it.

Is there good approach when there is disbalance lots of numbers and fewer words in text examples?

I don't think this matters much. If the numbers are not meaningful, you'd hope that the model will learn to ignore them.

But as always, the proof is in the pudding :wink:

Hi Sofie!
Thanks for the really quick answer.

Let me answer your questions:

Is there additional context to your sentences, or are they really only just containing the tokens "Starship Enterprise ML 09 55799" etc?

There are sentences with different length but most often only those two number gives the determination for the vessel name. If there are other entities they would be location, commission, charterer which can be same for different vessels and those can not be deterministic for the vessel name. Meaning that theoretically there can be same vessel name at the same location, same charterer and same commission. That's why I gave the examples without the "noisy" contents.

What do the entities look like in the knowledge base that you want to link to? Is there just one entity "Starship Enterprise", or are there different ones according to the built year / tonnage?

The initial knowledge base list that follows the "Starship Enterprise" looks like this:

[
     {'qid': 'Q7528556', 'name': 'Starship Enterprise', 'desc': 'Vessel with name Starship Enterprise IMO 7528556'}, 
     {'qid': 'Q9122825', 'name': 'Starship Enterprise', 'desc': 'Vessel with name Starship Enterprise IMO 9122825'}
]

Later when the "Starship Enterprise" name is added to Knowledge Base there only one alias of "Starship Enterprise" with two qids.

kb.add_alias(alias=name, entities=qids, probabilities=probs)

probs are calculated as 1 divided to the count of the number of qids

Since more tokens means more context and better decisions of the model what would be better. Short sentences, noisy tokens cleared, only meaningful for the entity tokens left. Or it would be better to have more tokens in the sentences with a chance of confusion.
For me the first choice is a better one.

Have great day!

I think if you have sufficiently large data to train on, the "noisy" tokens shouldn't matter, because the model should learn that they are not relevant to decide on the final ID.

Then, if the phrase "Starship Enterprise TOG 1977 NN 5766" always leads to the same QID, then regardless of whether you use the long sentence or the cleaned variant, in time the EL model should learn to link (the sentence containg) that phrase to the right QID - as long as each description of each QID is unique and meaningful.