📺 Video: Training a custom entity linking model with spaCy & Prodigy

In this new video, @SofieVL is showing how to use spaCy and Prodigy to train a custom entity linking model from scratch to disambiguate different mentions of the person "Emerson" to unique identifiers in a knowledge base. It uses a custom Prodigy recipe to create the training data, and all code and data used in the video is published on GitHub. Resolving and disambiguating named entities is something I've seen come up on this forum in the past, so I'm sure many of you will find this video helpful :slightly_smiling_face:

You can follow along in the notebook here:

And here's the Prodigy recipe used:

7 Likes

Hi @SofieVL thanks for the great demo. I am looking forward to use prodigy for our entity linking/ disambiguation component. In this example, one choice block exists per sentence which is shown on the UI at one time. Luckily every sentence includes one entity, "Emerson" in this case. I am looking into annotating sentences or snippets and they will typically contain 1-5 (rarely can be also 10-20) entities. Is there a way to include a choice block for each entity? In this case, every span will have to have an id property where we will store the normalized id's. Is this actually possible with prodigy? Many thanks in advance.

1 Like

Hi @dicle, happy to hear the video was useful to you!

In general I wouldn't recommend trying to annotate multiple entity links per sentence at the same time. If there are multiple occurrences of the same mention/alias (e.g. "Emerson" occurring twice in the same sentence), then you could highlight both spans and annotate them in one go, as typically they'll refer to the same person, except in rare occassions.

If you have multiple different mentions per sentence though, I would make sure that you have one task per entity. If you don't shuffle the stream, you'll still get them presented one by one, while you have the same sentence mentally in your head. It just feels like it'll be easier for the annotator, and you can work with the auto-accept function etc.

If you keep the same input hash per entity/task you're annotating, it should be straightforward to put the annotations back together afterwards.

1 Like

Hi @SofieVL Ok, many thanks! I have another question as well. We have normalized ID suggestions from a rule based normalizer. We would like to load these to Prodigy and the user will either accept them or correct them by choosing another alternative from the options. Is that possible for entity linking? I know it is possible to load suggestions for the NER step but for entity linking?

Yes, that should certainly be possible. I'm not sure if you worked with Prodigy before, but basically you can script the annotation recipe, just like I did in the NEL video you linked. At the point where I defined the candidate options to show on the annotation interface, you should be able to pull in your rule-based suggestions.

Hi, thanks. Yes, indeed. I had not thought it the way you described. That would work. Thanks!

1 Like

Hi, I'm in the same situation where I have multiple entities per sentence. It is working fine in that I have the same input hash, however, the stream keeps getting shuffled. How do I stop the stream from getting shuffled?

I'm also having trouble training the entity linking model when there is more than one entity per sentence -- because my task is still one entity per task, I end up having multiple tasks each with one span, but the NER model is recognizing more than one span, causing the model to fail with error:

RuntimeError: [E188] Could not match the gold entity links to entities in the doc - make sure the gold EL data refers to valid results of the named entity recognizer in the nlp pipeline.

Hi! That's definitely annoying because you want to be able to annotate the entities from the same sentence in sequence, and not have the input shuffled. I wonder where this happens though. Are you using a custom recipe? Is there a random.shuffle statement in there?

If you can share the recipe I'd be happy to help look into this further!

This error is typically thrown when the gold-standard "links" do not allign with entities in your data. Could you share the code how you're defining the entities to annotate in Prodigy, and then how you train on the resulting annotations? You probably need to ensure that you use the same NER in both steps.

Hi Sofie,

I know the reason for this error -- I am using the exact same NER, but sometimes the NER model finds more than one entity per data sample, however, the pipeline only assumes one entity per data sample. I just need to write a pipeline that does not assume one entity per data sample.

Can you please provide some guidance on how to do this in spacy v2? (I spent some time trying to do this in spacy v3 because it seemed clearer there, but unfortunately prodigy doesn't work with spacy v3 yet)

as for the custom recipe, my recipe was very similar to your emerson recipe here: https://github.com/explosion/projects/blob/master/nel-emerson/scripts/el_recipe.py -- I have not added any random.shuffle in my recipe and don't find any in yours either. However, it does indeed shuffle my data.

Hi Sofie, never mind, I have decided to go with spacy v3 again. No need to address my previous questions. I will keep asking questions if I have more issues.

1 Like

Ok, sure!

Hi @SofieVL ! I'm currently trying to adapt your video approach into my problem (which is disambiguate medical terms into their respective codes in a medical KB). I've decided to build a demo for that and trained a french model in spaCy to recognize them by annotating texts with PhraseMatcher, and have stored it in disk.

Next I followed your steps building a KB for my problem. But in the Prodigy annotation step, it seems you use a built-in NER model. My question is: can I use my updated spaCy NER model as input, or do I have to start over and train a model directly on prodigy?

I much appreciate if you give me some feedback! Thank you!

Hi Rafael,

You should definitely be able to use your custom NER model. In the Prodigy annotation recipe, an NLP model with pretrained NER is being loaded - this can be any pretrained NLP model you like. What happens specifically is that the EntityRecognizer you see in the Prodigy recipe code, will work with the "ner" component of your loaded NLP pipeline.

Does that answer your question?

Hi @SofieVL,
Thank you for the great video! I 'm also trying to apply Entity Linking from the video for my uni project. Sometimes I have several Entities in a sentence, for example:

"The Pakistani Supreme Court has abolished the death penalty for Asia Bibi, a Christian accused of insulting the Prophet Muhammad."

So I want to get the links for "The Pakistani Supreme Court", "Asia Bibi" and also "Prophet Muhammad". I wonder, whether there is a way to put several spans and QIDs into the annotation. As I see it, the suggested format should be dictionary - gold_dict = {"links": {(0, 12): {"Q7381115": 1.0, "Q2146908": 0.0}}} - and not a list as by training data for an entity recognizer - gold_dict = {"entities": ["U-PERS", "O", "O", "B-LOC", "L-LOC"]}.
Does it mean from one sentence I can have only one link? Because my first intuition was to do it like this:
gold_dict ={'links': [
{(65, 88): {'ORG-Supreme-Court-of-Pakistan': 1.0}},
{(218, 227): {'PER-Asia-Bibi': 1.0}},
{(262, 271): {'PER-Prophet-Muhammad': 1.0}}]}

But of course it doesn't work, because the format is different, it should be dict and not list of dicts. So the question is, what to do, if there are several entities in a sentence?
Thank you for your answer in advance!

Hi!

You're right - it needs to be a dict, and the dict takes as keys the entity offsets. So for multiple entities in one sentence, you can do:

gold_dict = {'links': 
{
(65, 88): {'ORG-Supreme-Court-of-Pakistan': 1.0},
(218, 227): {'PER-Asia-Bibi': 1.0},
(262, 271): {'PER-Prophet-Muhammad': 1.0}
}

Hi @SofieVL !

Thanks again for this excellent work-- I have the NEL recipe up and running in Prodigy and have happily begun annotating. I have a similar issue with multiple entities per doc as @mumud123 above, and understand from the comments that the EntityLinker pipeline component assumes one entity and linkage per data sample in training.

I have multiple entities per sentence (with the same label, but potentially linked to different KB candidates) and was trying to iterate over each ent in doc.ents for annotation purposes and later training. Put another way, I am hoping to do multiple annotations per sentence, showing each entity mention in sequence to the annotator. I have confirmed that the NER component identifies the correct labels.

I tried moving the doc parsing up in the _add_options block so that I could iterate over each ent in doc.ents rather than each mention in task["spans"], but I still wasn't able to yield each ent in sequence from my target sentence. I think my confusion lies in distinctions between tasks, texts, and the stream, so will happily dig through docs if you could point me there-- or if the solution is obvious to you would also appreciate some pointers!

Thanks again,
Adam

Hi @adamkgoldfarb ,

Since I recently went through this problem, maybe this will help you:

The entities are all still in the stream, but they are just shuffled into a random order. So what I did was I just sorted the stream by _input_hash - my line of code is this, right after _add_options:

stream = sorted(list(stream), key = lambda obj: obj['_input_hash'])

(I have tried adding shuffle = False, which I thought would do the same thing, but it did not work for me, which was what I was talking about above, and maybe the team can look into this?)

This causes a little bit of delay on start up - but I have used this with streams in the hundreds of thousands and it is not too bad. Also it doesn't sort the entities within a sentence in order, but that was fine for me.

Then later when you are training, you need to find all the entities within an _input_hash, and add them to the same spans

Hope that helps!

1 Like

Thanks @mumud123 ! I tried that out-- the startup time is LOOONG and I worry I might be running up on memory constraints for other users if I load the full corpus into memory, but I will continue to play with it.

I'm wondering if we're talking about the same issue though-- in one sentence, we have multiple different entities with the same label. For example (not the exact ents but this gives you an idea):

"New York (NY), New Jersey (NJ) and CT are the states in the tri-state area; South Jersey [false positive] is part of Philadelphia (PA), so is not considered part of the tri-state area. Central New Jersey [false positive] does not exist."

I want to link each bolded (state) entity to its canonical entity in the KB or otherwise flag as NIL, but that would only work if we can cycle through each entity in the sentence and present a different list of options for each. I think that would mean a separate task for each entity, but I'm not clear how to modify the stream to create that. So far my recipe is:

  1. Highlighting all entity mentions but
  2. Only surfacing options for the first highlighted entity and then
  3. Going to the next sentence completely, skipping the other entities in the sentence

From typing this all out, I feel like I need to go read the task docs-- will follow up!

Thanks again for your thoughts, @mumud123 !

@SofieVL if you have any other suggestions, I'm all ears!

Thanks,

Adam