📺 Video: Training a custom entity linking model with spaCy & Prodigy

ines · May 7, 2020, 12:40pm

In this new video, @SofieVL is showing how to use spaCy and Prodigy to train a custom entity linking model from scratch to disambiguate different mentions of the person "Emerson" to unique identifiers in a knowledge base. It uses a custom Prodigy recipe to create the training data, and all code and data used in the video is published on GitHub. Resolving and disambiguating named entities is something I've seen come up on this forum in the past, so I'm sure many of you will find this video helpful

You can follow along in the notebook here:

github.com

explosion/projects/blob/master/nel-emerson/scripts/notebook_video.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Introduction to Entity Linking"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This notebook provides a short tutorial on how to implement and use spaCy's Entity Linking functionality. It can be used together with [this video](https://www.youtube.com/watch?v=8u57WSXVpmw)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [

This file has been truncated. show original

And here's the Prodigy recipe used:

github.com

explosion/projects/blob/master/nel-emerson/scripts/el_recipe.py

"""
Custom Prodigy recipe to perform manual annotation of entity links,
given an existing NER model and a knowledge base performing candidate generation.
"""

import spacy
from spacy.kb import KnowledgeBase

import prodigy
from prodigy.models.ner import EntityRecognizer
from prodigy.components.loaders import TXT
from prodigy.util import set_hashes
from prodigy.components.filters import filter_duplicates

import csv
from pathlib import Path


@prodigy.recipe(
    "entity_linker.manual",

This file has been truncated. show original

dicle · January 20, 2021, 4:51pm

Hi @SofieVL thanks for the great demo. I am looking forward to use prodigy for our entity linking/ disambiguation component. In this example, one choice block exists per sentence which is shown on the UI at one time. Luckily every sentence includes one entity, "Emerson" in this case. I am looking into annotating sentences or snippets and they will typically contain 1-5 (rarely can be also 10-20) entities. Is there a way to include a choice block for each entity? In this case, every span will have to have an id property where we will store the normalized id's. Is this actually possible with prodigy? Many thanks in advance.

SofieVL · January 21, 2021, 10:41am

Hi @dicle, happy to hear the video was useful to you!

In general I wouldn't recommend trying to annotate multiple entity links per sentence at the same time. If there are multiple occurrences of the same mention/alias (e.g. "Emerson" occurring twice in the same sentence), then you could highlight both spans and annotate them in one go, as typically they'll refer to the same person, except in rare occassions.

If you have multiple different mentions per sentence though, I would make sure that you have one task per entity. If you don't shuffle the stream, you'll still get them presented one by one, while you have the same sentence mentally in your head. It just feels like it'll be easier for the annotator, and you can work with the auto-accept function etc.

If you keep the same input hash per entity/task you're annotating, it should be straightforward to put the annotations back together afterwards.

dicle · January 21, 2021, 12:11pm

Hi @SofieVL Ok, many thanks! I have another question as well. We have normalized ID suggestions from a rule based normalizer. We would like to load these to Prodigy and the user will either accept them or correct them by choosing another alternative from the options. Is that possible for entity linking? I know it is possible to load suggestions for the NER step but for entity linking?

SofieVL · January 21, 2021, 12:50pm

Yes, that should certainly be possible. I'm not sure if you worked with Prodigy before, but basically you can script the annotation recipe, just like I did in the NEL video you linked. At the point where I defined the candidate options to show on the annotation interface, you should be able to pull in your rule-based suggestions.

dicle · January 21, 2021, 5:49pm

Hi, thanks. Yes, indeed. I had not thought it the way you described. That would work. Thanks!

mumud123 · February 18, 2021, 10:51pm

Hi, I'm in the same situation where I have multiple entities per sentence. It is working fine in that I have the same input hash, however, the stream keeps getting shuffled. How do I stop the stream from getting shuffled?

mumud123 · February 19, 2021, 12:25am

I'm also having trouble training the entity linking model when there is more than one entity per sentence -- because my task is still one entity per task, I end up having multiple tasks each with one span, but the NER model is recognizing more than one span, causing the model to fail with error:

RuntimeError: [E188] Could not match the gold entity links to entities in the doc - make sure the gold EL data refers to valid results of the named entity recognizer in the nlp pipeline.

SofieVL · February 19, 2021, 10:26am

Hi! That's definitely annoying because you want to be able to annotate the entities from the same sentence in sequence, and not have the input shuffled. I wonder where this happens though. Are you using a custom recipe? Is there a random.shuffle statement in there?

If you can share the recipe I'd be happy to help look into this further!

This error is typically thrown when the gold-standard "links" do not allign with entities in your data. Could you share the code how you're defining the entities to annotate in Prodigy, and then how you train on the resulting annotations? You probably need to ensure that you use the same NER in both steps.

mumud123 · February 20, 2021, 6:05am

Hi Sofie,

I know the reason for this error -- I am using the exact same NER, but sometimes the NER model finds more than one entity per data sample, however, the pipeline only assumes one entity per data sample. I just need to write a pipeline that does not assume one entity per data sample.

Can you please provide some guidance on how to do this in spacy v2? (I spent some time trying to do this in spacy v3 because it seemed clearer there, but unfortunately prodigy doesn't work with spacy v3 yet)

mumud123 · February 20, 2021, 6:07am

as for the custom recipe, my recipe was very similar to your emerson recipe here: https://github.com/explosion/projects/blob/master/nel-emerson/scripts/el_recipe.py -- I have not added any random.shuffle in my recipe and don't find any in yours either. However, it does indeed shuffle my data.

mumud123 · February 21, 2021, 1:02am

Hi Sofie, never mind, I have decided to go with spacy v3 again. No need to address my previous questions. I will keep asking questions if I have more issues.

SofieVL · February 22, 2021, 8:11am

Ok, sure!

RafaelDaddio · March 13, 2021, 9:36pm

Hi @SofieVL ! I'm currently trying to adapt your video approach into my problem (which is disambiguate medical terms into their respective codes in a medical KB). I've decided to build a demo for that and trained a french model in spaCy to recognize them by annotating texts with PhraseMatcher, and have stored it in disk.

Next I followed your steps building a KB for my problem. But in the Prodigy annotation step, it seems you use a built-in NER model. My question is: can I use my updated spaCy NER model as input, or do I have to start over and train a model directly on prodigy?

I much appreciate if you give me some feedback! Thank you!

SofieVL · March 15, 2021, 4:13pm

Hi Rafael,

You should definitely be able to use your custom NER model. In the Prodigy annotation recipe, an NLP model with pretrained NER is being loaded - this can be any pretrained NLP model you like. What happens specifically is that the EntityRecognizer you see in the Prodigy recipe code, will work with the "ner" component of your loaded NLP pipeline.

Does that answer your question?

stewieboomhauer · April 12, 2021, 1:42pm

Hi @SofieVL,
Thank you for the great video! I 'm also trying to apply Entity Linking from the video for my uni project. Sometimes I have several Entities in a sentence, for example:

"The Pakistani Supreme Court has abolished the death penalty for Asia Bibi, a Christian accused of insulting the Prophet Muhammad."

So I want to get the links for "The Pakistani Supreme Court", "Asia Bibi" and also "Prophet Muhammad". I wonder, whether there is a way to put several spans and QIDs into the annotation. As I see it, the suggested format should be dictionary - gold_dict = {"links": {(0, 12): {"Q7381115": 1.0, "Q2146908": 0.0}}} - and not a list as by training data for an entity recognizer - gold_dict = {"entities": ["U-PERS", "O", "O", "B-LOC", "L-LOC"]}.
Does it mean from one sentence I can have only one link? Because my first intuition was to do it like this:
gold_dict ={'links': [
{(65, 88): {'ORG-Supreme-Court-of-Pakistan': 1.0}},
{(218, 227): {'PER-Asia-Bibi': 1.0}},
{(262, 271): {'PER-Prophet-Muhammad': 1.0}}]}

But of course it doesn't work, because the format is different, it should be dict and not list of dicts. So the question is, what to do, if there are several entities in a sentence?
Thank you for your answer in advance!

SofieVL · April 12, 2021, 2:39pm

Hi!

You're right - it needs to be a dict, and the dict takes as keys the entity offsets. So for multiple entities in one sentence, you can do:

gold_dict = {'links': 
{
(65, 88): {'ORG-Supreme-Court-of-Pakistan': 1.0},
(218, 227): {'PER-Asia-Bibi': 1.0},
(262, 271): {'PER-Prophet-Muhammad': 1.0}
}

adamkgoldfarb · April 16, 2021, 3:15pm

Hi @SofieVL !

Thanks again for this excellent work-- I have the NEL recipe up and running in Prodigy and have happily begun annotating. I have a similar issue with multiple entities per doc as @mumud123 above, and understand from the comments that the EntityLinker pipeline component assumes one entity and linkage per data sample in training.

I have multiple entities per sentence (with the same label, but potentially linked to different KB candidates) and was trying to iterate over each ent in doc.ents for annotation purposes and later training. Put another way, I am hoping to do multiple annotations per sentence, showing each entity mention in sequence to the annotator. I have confirmed that the NER component identifies the correct labels.

I tried moving the doc parsing up in the _add_options block so that I could iterate over each ent in doc.ents rather than each mention in task["spans"], but I still wasn't able to yield each ent in sequence from my target sentence. I think my confusion lies in distinctions between tasks, texts, and the stream, so will happily dig through docs if you could point me there-- or if the solution is obvious to you would also appreciate some pointers!

Thanks again,
Adam

mumud123 · April 16, 2021, 5:54pm

Hi @adamkgoldfarb ,

Since I recently went through this problem, maybe this will help you:

The entities are all still in the stream, but they are just shuffled into a random order. So what I did was I just sorted the stream by _input_hash - my line of code is this, right after _add_options:

stream = sorted(list(stream), key = lambda obj: obj['_input_hash'])

(I have tried adding shuffle = False, which I thought would do the same thing, but it did not work for me, which was what I was talking about above, and maybe the team can look into this?)

This causes a little bit of delay on start up - but I have used this with streams in the hundreds of thousands and it is not too bad. Also it doesn't sort the entities within a sentence in order, but that was fine for me.

Then later when you are training, you need to find all the entities within an _input_hash, and add them to the same spans

Hope that helps!

adamkgoldfarb · April 16, 2021, 6:17pm

Thanks @mumud123 ! I tried that out-- the startup time is LOOONG and I worry I might be running up on memory constraints for other users if I load the full corpus into memory, but I will continue to play with it.

I'm wondering if we're talking about the same issue though-- in one sentence, we have multiple different entities with the same label. For example (not the exact ents but this gives you an idea):

"New York (NY), New Jersey (NJ) and CT are the states in the tri-state area; South Jersey [false positive] is part of Philadelphia (PA), so is not considered part of the tri-state area. Central New Jersey [false positive] does not exist."

I want to link each bolded (state) entity to its canonical entity in the KB or otherwise flag as NIL, but that would only work if we can cycle through each entity in the sentence and present a different list of options for each. I think that would mean a separate task for each entity, but I'm not clear how to modify the stream to create that. So far my recipe is:

Highlighting all entity mentions but
Only surfacing options for the first highlighted entity and then
Going to the next sentence completely, skipping the other entities in the sentence

From typing this all out, I feel like I need to go read the task docs-- will follow up!

Thanks again for your thoughts, @mumud123 !

@SofieVL if you have any other suggestions, I'm all ears!

Thanks,

Adam

Topic		Replies	Views
Entity Linking (prodigy training) usage , solved , nel	7	1006	September 11, 2024
two EntityRecognizers Getting Started ner	4	181	November 28, 2023
Entity Linking demo error ner , spacy , nel	4	306	February 23, 2024
spaCy, prodigy, annotation usage , ner , solved	2	721	February 8, 2019
ner.teach to silver to gold -- how to best leverage Prodigy's recipes usage , ner	2	1291	August 19, 2019

📺 Video: Training a custom entity linking model with spaCy & Prodigy

Related topics