📺 Video: Training a custom entity linking model with spaCy & Prodigy

I think it is the same problem actually -- I'm definitely able to set a different KB entity for each bolded text -- the entity recognizer model generates a different task for each entity, so you should be able to set different options for each bolded option -- they are just shuffled for me.

if it helps, here is my code:

nlp = spacy.load(nlp_dir)
kb = KnowledgeBase(vocab=nlp.vocab, entity_vector_length=300)
kb.from_disk(kb_loc)
model = EntityRecognizer(nlp)
# ...
stream = JSONL(source)
stream = [set_hashes(eg) for eg in stream]
stream = (eg for score, eg in model(stream))
stream = _add_options(stream, kb, id_dict, entities_dt)
stream = sorted(list(stream), key = lambda obj: obj['_input_hash'])

oh also make sure that your source is not already annotated -- in that case it overrides the EntityRecognizer -- and in that case there are multiple entities in the same spans list in the JSON object, and when you set hashes on that you get the same task id for one sentence -- whereas the EntityRecognizer would only put entity in the spans list, even when there are multiple entities.

1 Like

WE'RE IN BUSINESS.

@mumud123 : thanks so much for chiming in. What worked: switching around filter_duplicates! It was filtering on input hash, and I needed to filter on the task hash.

Your insight into the EntityRecognizer was also helpful. I was not clear that it iterated through and surfaced each ent (I thought it was just the NER from the loaded nlp model), and I was halfway down the road of just using the upstream model's labeled spans because I didn't see them all highlighted at once-- and because I would only link one entity and then move onto the next sentence, it wasn't clear that the model was picking up all the ents and serving them to Prodigy through the recipe. Finally, I started out not using a sentencizer and that probably confused things more at the beginning...

The sorting mechanism you note would likely work, but I wasn't able to sit more than five minutes waiting for the corpus stream to exhaust itself into the list comprehension you shared. I can handle jumping around the sentence on successive linkings.

Thanks again, @mumud123 !

2 Likes

Great to hear you got it working, and thanks @mumud123 for helping out!

I'll have a look into the shuffling behaviour, I agree the behaviour is a bit unintuitive.

1 Like

@SofieVL Sorry for the noisy notifications on a Friday! Credit to you all for building the community, and thanks again to @mumud123 for the suggestions!

1 Like

@SofieVL I am encountering the same problem as @mumud123: I want to train an Entity Linking system with separate annotations of multiple entities within the same text, but I get the famous RuntimeError: [E188] Could not match the gold entity links to entities in the doc -
make sure the gold EL data refers to valid results of the named entity recognizer
in the nlp-pipeline.
I am using a custom trained NER, trained in SpaCy v2, that is used in both the annotation and training step. I would like to keep using SpaCy v2 because I also use Prodigy for the annotation.

Do you have any tips on how I could write a pipeline that doesn't assume one entity per data sample, as @mumud123 stated?

Thanks in advance for your answer!

@jbbosman - you can use prodigy nightly with SpaCy v3, I also started with v2 but switched to v3, it took some getting used to but I think it was worthwhile.

2 Likes

Thank you for the quick reply! I've asked my company and they will apply for the nightly release.

May I ask how you approached writing a pipeline that does not assume one entity per data sample, as you wrote in one of you answers above?

Hi all,

So for the combination of spaCy v2 and Prodigy v1.10. You can try out some sample code here: https://github.com/explosion/projects/tree/master/nel-emerson

First, create the KB with the method create_kb from https://github.com/explosion/projects/blob/master/nel-emerson/scripts/el_tutorial.py. Then, run Prodigy like this:

prodigy entity_linker.manual your_out_db prodigy/emerson_input_text.txt output/my_nlp/ output/my_kb input/entities.csv -F scripts/el_recipe.py

You'll see that the sentences are fed into Prodigy in the same order as they are sourced from emerson_input_text.txt.

Now, importantly, there was indeed an issue with this line from el_recipe.py (as pointed out by @adamkgoldfarb):

    stream = filter_duplicates(stream, by_input=True, by_task=False)

If you have multiple entities per input sentence/doc, this should be:

    stream = filter_duplicates(stream, by_input=False, by_task=True)

I'll adjust the code online accordingly.

To test this, you could try changing the fourth sentence in the emerson_input_text.txt from

Emerson scored one goal during the tournament, in Brazil's 7–0 win in its opening group match against Venezuela.

to

Emerson scored one goal during the tournament, in Brazil's 7–0 win in its opening group match against Emerson.

If you run Prodigy again (specify a different output db to start from scratch), you'll see that the sentences are still in order, and the fourth sentence is presented twice, once for each "Emerson" example.

Hope this helps. If anyone still experiences trouble with v2 and Prodigy 1.10, please provide the exact recipe and KB creation etc, so that I can try and replicate! (here or in a new issue and feel free to ping me)

With respect to spaCy v3 and Prodigy v1.11:

(I have tried adding shuffle = False, which I thought would do the same thing, but it did not work for me, which was what I was talking about above, and maybe the team can look into this?)

I haven't been able to replicate this yet. When I start from similar code as in https://github.com/explosion/projects/blob/v3/tutorials/nel_emerson/scripts/el_recipe.py, it does present the examples in order.

One thing I can imagine though, is that there's a difference between using an entity_ruler or actually running a proper NER. When the latter is in the pipeline, I'm guessing that perhaps Prodigy is internally sorting the entities on confidence value.

Again, happy to look into this further if you can provide more details on the recipe and nlp pipeline so I can reproduce it!

@SofieVL Let me know if this belongs as an issue in the el_recipe repo or the discussion board:

I recall from that repo that the Entity Linker and KB require a model with static vectors.

I've created a kb with a model that uses roberta_base as the transformer drop-in for tok2vec and implements user hooks to replace the .vector methods, storing 768-d vectors for entity descriptions. The NER component has better recall using trf instead of static vectors, so I'm thinking of using trf upstream to capture a larger proportion of entities for later linking, upping the chance that we find a good match.

Before I go too far down the road of trying to train an EL model with trf upstream, are there details of the Entity Linker that you know of that would cause this not to work? Does the EL model rely on the static vectors in a way that's not immediately obvious? If it calls a pipeline with user hooks reimplementing the .vectors method on docs, will that solve any compatibility issues?

I'm definitely going to try training a large vectors EL model as well, but am curious about the trf model, so wondering if you know of any dragons I should be aware of!

Thanks as always,
Adam

Hi Adam,

The projects repo doesn't have an issue tracker, but yes if you have general EL-usage questions, those are probably better asked on the spaCy discussion board.

Anyway to answer your question - I think your approach is sound and should work? The entity descriptions are indeed the most important bit there and sounds like you've managed those. Then, the entity_linker's model refers to a Tok2Vec implementation that hopefully should just work with a Transformer. I'd be interested in hearing the results for that!

1 Like

Hello @SofieVL!

Thank you for your reply! I brought my data to the format you suggested and right now I am facing another kind of problem with Example-Object and training loop. Here is an example of links_offset, that I have:

gold_dict = {'links': {(137, 143): {'ORG-REGNUM-News-Agency': 1.0}, (368, 376): {'PRO-Samaa-TV': 1.0}, (89, 98): {'PER-Asia-Bibi': 1.0}, (260, 270): {'PER-Asia-Bibi': 1.0}, (413, 422): {'PER-Asia-Bibi': 1.0}, (505, 514): {'PER-Asia-Bibi': 1.0}, (390, 396): {'PER-Shehryar-Khan-Afridi': 1.0}, (650, 665): {'ORG-Supreme-Court-of-Pakistan': 1.0}, (530, 555): {'ORG-Supreme-Court-of-Pakistan': 1.0}, (313, 328): {'ORG-Supreme-Court-of-Pakistan': 1.0}, (63, 76): {'ORG-Ministry-of-the-interior-Pakistan': 1.0}, (167, 176): {'GPE-Pakistan': 1.0}, (177, 191): {'PER-Shehryar-Khan-Afridi': 1.0}}}

So I am giving the Example object this gold_dict, but for some reason it doesn't recognize it as links. When I print Example, I see that 'links'-dict remains empty. For this reason I get an error, that states: "[E981] The offsets of the annotations for links could not be aligned to token boundaries."
What am I doing wrong here?

As for token boundaries, I am sure that the entity links correspond to proper token boundaries, because before I trained NER model with the same token boundaries and it worked perfectly.

Thanks again,
Aleksandra

Hi Aleksandra,

Sorry, I should have been more clear (and we should update the docs, too). When providing gold links to the training, the gold data should also include the entities that the links are referring to. Additionally, gold sentences are required as well. Unfortunately this makes the annotation a bit verbose, but it hopefully shouldn't be too big a problem to create this programmaticaly. Something like this:

doc = nlp("Russ Cochran his reprints include EC Comics.")
gold_dict = {"entities": [(0, 12, "PERSON")],
             "links": {(0, 12): {"Q7381115": 1.0, "Q2146908": 0.0}},
             "sent_starts": [1, -1, -1, -1, -1, -1, -1, -1]}
example = Example.from_dict(doc, gold_dict)

Instead of "gold" sentences, you can also run a sentencizer orso on the "gold" reference doc instead:

example.reference = nlp.get_pipe("sentencizer")(example.reference)

Do you have some example code of how to train the entity linker in SpaCy v3 if you already have an annotated dataset?

You can checkout the example NEL project here: projects/tutorials/nel_emerson at v3 · explosion/projects · GitHub and more specifically the config file that was used for training in v3: projects/nel.cfg at v3 · explosion/projects · GitHub

The project yml file documents all the steps needed. More information on spacy projects can be found here: https://spacy.io/usage/projects

Hope that helps!

Yes exactly, I was using and NER model, not entity_ruler

Hi @SofieVL,
Thank you a lot for your response! Right now I am at the point, when entity offsets and links look well in the Exampe object, the only remained problem is sentences. Unfourtanetly, nlp_ru.get_pipe("sentencizer")(example.reference) doesn't work as it is suppoused to work. Instead of getting:"sent_starts": [1, -1, -1, -1, -1, -1, -1, -1] I get the given text. Therefore while training I get a ValueError, which states that I have problems with a Parser in the pipeline:

/usr/local/lib/python3.7/dist-packages/spacy/language.py in update(self, examples, _, drop, sgd, losses, component_cfg, exclude)
   1110             if name in exclude or not hasattr(proc, "update"):
   1111                 continue
-> 1112             proc.update(examples, sgd=None, losses=losses, **component_cfg[name])
   1113         if sgd not in (None, False):
   1114             for name, proc in self.pipeline:

/usr/local/lib/python3.7/dist-packages/spacy/pipeline/transition_parser.pyx in spacy.pipeline.transition_parser.Parser.update()

/usr/local/lib/python3.7/dist-packages/spacy/pipeline/transition_parser.pyx in spacy.pipeline.transition_parser.Parser._init_gold_batch()

/usr/local/lib/python3.7/dist-packages/spacy/pipeline/_parser_internals/transition_system.pyx in spacy.pipeline._parser_internals.transition_system.TransitionSystem.get_oracle_sequence_from_state()

/usr/local/lib/python3.7/dist-packages/spacy/pipeline/_parser_internals/ner.pyx in spacy.pipeline._parser_internals.ner.BiluoPushDown.set_costs()

Although I disabled all other pipes in a pipeline, if they are not "entity_linker" and "sentencizer". Do I get this error because sentence boundary detection doesn't work properly? Is there an option how to avoid this error?

P.S. There are several sentences in my doc object, but they all are recognized as one sentence. I can see it when I print my Example object: 'SENT_START': [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]

Thank you for your help!

Best,
Aleksandra

Hi Aleksandra,

That shouldn't happen. Could you provide a minimal code snippet that I can use to reproduce this?

@SofieVL ,Yes, sure.
Here is info about spaCy that I use:

================ Info about spaCy ===================

spaCy version    3.0.6                         
Location         /usr/local/lib/python3.7/dist-packages/spacy
Platform         Linux-4.19.112+-x86_64-with-Ubuntu-18.04-bionic
Python version   3.7.10                        
Pipelines        en_core_web_lg (3.0.0), ru_core_news_lg (3.0.0)

Here is one of the elements from the train dataset:

[(ru-136
  ru
  2018-11-03
  https://minval.az/news/123836756
  Адвокат обвиненной в богохульстве христианки сбежал из Пакистана
  
  Адвокат христианки Азии Биби, оправданной Верховным судом Пакистана по обвинению в богохульстве, покинул страну из-за страха за свою жизнь, сообщает »Би-би-си». Он отметил, что уехал для того, чтобы иметь возможность и дальше защищать интересы Азии Биби. Также сообщается, что пакистанские власти запретили женщине покидать страну, чтобы положить конец массовым протестам из-за ее освобождения. Ранее Верховный суд Пакистана оправдал Азию Биби, которая была обвинена в богохульстве и приговорена к смертной казни. Она провела в заключении восемь лет. Это решение вызвало многотысячные протесты в Пакистане.,
  {'entities': [(139, 148, 'PER'),
    (364, 373, 'PER'),
    (554, 563, 'PER'),
    (270, 278, 'ORG'),
    (521, 544, 'ORG'),
    (162, 187, 'ORG'),
    (109, 118, 'LOC'),
    (716, 725, 'LOC')],
   'links': {(109, 118): {'GPE-Pakistan': 1.0},
    (139, 148): {'PER-Asia-Bibi': 1.0},
    (162, 187): {'ORG-Supreme-Court-of-Pakistan': 1.0},
    (270, 278): {'ORG-BBC-Ltd': 1.0},
    (364, 373): {'PER-Asia-Bibi': 1.0},
    (521, 544): {'ORG-Supreme-Court-of-Pakistan': 1.0},
    (554, 563): {'PER-Asia-Bibi': 1.0},
    (716, 725): {'GPE-Pakistan': 1.0}}})]

Here is the info about pipeline:

[('tok2vec', <spacy.pipeline.tok2vec.Tok2Vec object at 0x7f970671e050>), ('morphologizer', <spacy.pipeline.morphologizer.Morphologizer object at 0x7f9706612950>), ('parser', <spacy.pipeline.dep_parser.DependencyParser object at 0x7f9706760d70>), ('ner', <spacy.pipeline.ner.EntityRecognizer object at 0x7f9706760ec0>), ('attribute_ruler', <spacy.pipeline.attributeruler.AttributeRuler object at 0x7f97065ae370>), ('lemmatizer', <spacy.lang.ru.lemmatizer.RussianLemmatizer object at 0x7f9706590af0>), ('sentencizer', <spacy.pipeline.sentencizer.Sentencizer object at 0x7f9706845460>), ('entity_linker', <spacy.pipeline.entity_linker.EntityLinker object at 0x7f96efc51fb0>)]

And here is my training loop:

other_pipes = [pipe for pipe in nlp_ru.pipe_names if pipe != "entity_linker" and "sentencizer"]

with nlp_ru.disable_pipes(*other_pipes):   # train only the entity_linker

    optimizer = nlp_ru.create_optimizer()

    for itn in range(100):

        random.shuffle(train_dataset)

        for raw_text, entity_offsets in train_dataset:

            example = Example.from_dict(raw_text, entity_offsets)

            print(example)

            example.reference = nlp_ru.get_pipe("sentencizer")(example.reference)

            nlp.update([example], sgd=optimizer)

Is this information sufficient or do you also need some info about knowledge base?

Best regards,
Aleksandra