Hi! Thanks for the interest in our NEL work I think this may be slightly out-of-scope for this forum, but I know some users have been using Prodigy to annotate NEL data, so I'll try to clarify (inline):
I would word this differently, because "the KB" is ambiguous and I'm not 100% sure which one you mean. You have your original knowledge base, like Wikidata, but you also have a pruned version of it on disk which is the actual KnowledgeBase object. The latter contains a pruned version of the first, as there are very many infrequent aliases that would otherwise blow up performance. So when I use the term "KB", it refers to the pruned version that the algorithm has access to.
So, if the entity can not be linked (prediction="NIL") because the right ID is not in the KB, but there is one annotated in the gold data, that is in fact a FN. If it's "NIL" because the entity was not in Wikidata and thus the gold is also "NIL", then it's a true TN.
This is what is being measured by the "oracle KB" on the slide that obtains 84.2%. If we assume that we can always pick the correct candidate from the list of candidates from the KB, we would still only obtain 84.2% accuracy because the KB is missing some aliases and because the candidate generator doesn't always provide the correct one in its final list.
To summarize: a TN is an entity in the text that does not have a proper ID in Wikidata, and is also not disambiguated to one by the NEL approach. Image a news story about a woman in a traffic accident: her full name would NER'd to "PERSON", but she would likely not have an entry in Wikidata.
When we evaluate the NEL algorithm, we don't make a distinction as to whether or not the ID was in the KB. We just check whether the final result matches the gold annotation. So yes, if we predict a wrong KB for any reason, it's a FP. In fact, if we predict "Q342" when it should have been Q666, that's both a FP (Q342 is wrong) and a FN (Q666 is missing). If we predict "Q342" when it should have been "NIL", that's just one FP.
Right: there is a gold annotation, so the ID exists in Wikidata, but the prediction is "NIL". This may have various reasons (ID not KB, ID not produced by candidate generator, or sentence encoder wasn't sure and didn't make a decision).
I'm very confused as to why I wrote "accuracy" on these slides. The numbers reported are F-scores - I'm certain of that and just double checked the run logs. I'm usually very picky at evaluations & naming them, so I'm pretty surprised by this, but that's how it is. Apologies for the confusion!