Entity Linking evaluation

Hi all,

I was wondering how I could evaluate my EL model. I saw in Sofies talk on slide 18 that she uses the accuracy measure.

Accuracy is defined as (true positives + true negatives)/total population.
But what are TP, TN, FP, and FN? I see the following cases:

  • TP: entities that were linked to their correct identifier in the KB
  • TN: entities that cannot be linked (ID='NIL'), while the right ID is not in the KB
  • FP:
    1. Link a found entity to a wrong ID in the KB, even though the right ID is in the KB
    2. Link a found entity to a wrong ID in the KB, while the right ID is not in the KB
  • FN: Cannot link a found Entity (ID='NIL'), even though the right ID is in the KB

My questions are:

  1. Am I missing something or can I calculate the accuracy using the counts of these cases?
  2. Or would the TP, TN, FP, and FN be defined in another way? E.g. I am not sure about the FP 1.+2., if they both contribute to FP or if 2. should be left out.
  3. Would it also make sense to look at Precision/Recall/F1-measure using these counts?
  4. How did you calculate the accuracy on slide 18?

Any comments on this appreciated :slight_smile: Thanks

1 Like

Hi! Thanks for the interest in our NEL work :slight_smile: I think this may be slightly out-of-scope for this forum, but I know some users have been using Prodigy to annotate NEL data, so I'll try to clarify (inline):


I would word this differently, because "the KB" is ambiguous and I'm not 100% sure which one you mean. You have your original knowledge base, like Wikidata, but you also have a pruned version of it on disk which is the actual KnowledgeBase object. The latter contains a pruned version of the first, as there are very many infrequent aliases that would otherwise blow up performance. So when I use the term "KB", it refers to the pruned version that the algorithm has access to.

So, if the entity can not be linked (prediction="NIL") because the right ID is not in the KB, but there is one annotated in the gold data, that is in fact a FN. If it's "NIL" because the entity was not in Wikidata and thus the gold is also "NIL", then it's a true TN.

This is what is being measured by the "oracle KB" on the slide that obtains 84.2%. If we assume that we can always pick the correct candidate from the list of candidates from the KB, we would still only obtain 84.2% accuracy because the KB is missing some aliases and because the candidate generator doesn't always provide the correct one in its final list.

To summarize: a TN is an entity in the text that does not have a proper ID in Wikidata, and is also not disambiguated to one by the NEL approach. Image a news story about a woman in a traffic accident: her full name would NER'd to "PERSON", but she would likely not have an entry in Wikidata.

When we evaluate the NEL algorithm, we don't make a distinction as to whether or not the ID was in the KB. We just check whether the final result matches the gold annotation. So yes, if we predict a wrong KB for any reason, it's a FP. In fact, if we predict "Q342" when it should have been Q666, that's both a FP (Q342 is wrong) and a FN (Q666 is missing). If we predict "Q342" when it should have been "NIL", that's just one FP.

Right: there is a gold annotation, so the ID exists in Wikidata, but the prediction is "NIL". This may have various reasons (ID not KB, ID not produced by candidate generator, or sentence encoder wasn't sure and didn't make a decision).

I'm very confused as to why I wrote "accuracy" on these slides. The numbers reported are F-scores - I'm certain of that and just double checked the run logs. I'm usually very picky at evaluations & naming them, so I'm pretty surprised by this, but that's how it is. Apologies for the confusion!

You can find the original code used to calculate the metrics here: https://github.com/explosion/projects/blob/master/nel-wikipedia/entity_linker_evaluation.py. Do note that this assumes there are no gold "NIL"'s in the data, which makes the logic a bit more simple.

Let me know if anything else is unclear!


Thank you very much @SofieVL, your detailed answer helps me a lot.

Indeed I use prodigy to annotate my EL data, and I was also wondering if it is the right forum to ask. If you want I can repost the question in Stackoverflow as it might help others in future.

That is also the one I meant. Thanks.

Thank you for the clarification!

Thanks, this is great!

1 Like

No that's fine, we'll leave the discussion here for others to find :slight_smile:

BTW one more thing I forgot to mention, is that spaCy v3 will have a built-in scoring mechanism for entity links: https://nightly.spacy.io/api/scorer#score_links and the code is a bit more cleanly implemented than what I linked before: https://github.com/explosion/spaCy/blob/develop/spacy/scorer.py#L469

1 Like