📺 Video: Training a custom entity linking model with spaCy & Prodigy

SofieVL · April 23, 2021, 10:53pm

That's already very helpful, thanks Aleksandra. The training loop you've posted certainly seems to disable the correct pipes. Is this the exact code you ran when you got that error message from the parser? Because after nlp_ru.disable_pipes(*other_pipes), the parser shouldn't actually be in the pipeline anymore. You could double check this by printing out nlp_ru.pipeline before and after that with statement.

You don't have to provide all the details of your knowledge base, but can you show a bit how your pipeline was constructed? Was it just the pretrained model from spacy, with a few additional pipes added? Which components did you (re)define?

stewieboomhauer · April 26, 2021, 8:07am

Hey @SofieVL,

Thank you again for answering! Yes, this is exactly the code, that I have been using through out time. As you suggested I printed nlp_ru.pipeline before and after that with statement. The line that disables other pipelines works well, because before the training loop, all components are still in the pipeline, whereas when I print it inside the training loop, only [('sentencizer', <spacy.pipeline.sentencizer.Sentencizer object at 0x7fab3efaf7d0>), entity_linker', <spacy.pipeline.entity_linker.EntityLinker object at 0x7fab1fab56b0>)] remain. Therefore I really don't get where the error comes from, when the Parser component is disabled.

As for pipeline construction:
Yes, I only use spaCy's pretrained model for Russian language and add two additional pipes to it, namely Sentencizer and Entity_linker:

nlp_ru = spacy.load('ru_core_news_lg')
sentencizer = nlp_ru.add_pipe("sentencizer")
entity_linker = nlp_ru.add_pipe("entity_linker")
entity_linker.set_kb(create_kb)

I haven't redefined any other pipes. Should I have removed the Parser before adding the Sentencizer?

Best regards,
Aleksandra

SofieVL · April 26, 2021, 1:36pm

Aleksandra: Inspecting your original error message once more, it looks like you're getting this error message from the ner , not the parser. Both use the transition_parser internally, but it looks like your error is related to ner.pyx specifically.

Still, from your code and from your output, it looks like also the ner component is disabled.

One more idea that comes to mind: is the training data somehow generated with the ner component? Do you use the ner anywhere at all? To be absolutely certain, you could try the following:

nlp_ru = spacy.load('ru_core_news_lg', disable=["parser", "ner"])

stewieboomhauer · May 7, 2021, 4:30pm

Dear Sofie @SofieVL,

One more idea that comes to mind: is the training data somehow generated with the ner component? Do you use the ner anywhere at all?

Answering your question, I haven't used the ner component of the pipeline at all. I already had annotated data, so I didn't need to annotate anything by myself.

However, I am dealing with another problem right now. When I am using my old training loop, namely this one:
other_pipes = [pipe for pipe in nlp_ru.pipe_names if pipe != "entity_linker" and pipe !="sentencizer"]

with nlp_ru.disable_pipes(*other_pipes):   # train only the entity_linker
    print(nlp_ru.pipeline)
    optimizer = entity_linker.create_optimizer()

    for itn in range(10):
        random.shuffle(train_dataset)
        for raw_text, entity_offsets in train_dataset:
            print(raw_text)
            print(entity_offsets)
            example = Example.from_dict(raw_text, entity_offsets)
            print(example)
            example.reference = nlp_ru.get_pipe("sentencizer")(example.reference)
            entity_linker.update([example], sgd=optimizer)

I am getting the following error message:
KeyError: "Parameter 'E' for model 'hashembed' has not been allocated yet."
I found this discussion thread and tried not to disable "toc2vec" component as it was said in the discussion, but unfourtanetly got the same error message.

I also tried to simplify the training loop and used this instead:

other_pipes = [pipe for pipe in nlp_ru.pipe_names if pipe != "entity_linker" and pipe !="sentencizer"]
with nlp_ru.disable_pipes(*other_pipes):   
    print(nlp_ru.pipeline)
    optimizer = nlp_ru.create_optimizer()
    examples = []
    random.shuffle(train_dataset)
    for text, annots in train_dataset:
          try:
            examples.append(Example.from_dict(text, annots))
          except:
            pass    
losses = nlp_ru.update(examples, sgd=optimizer)

With this loop I get the error about sentence boundaries again:
ValueError: [E030] Sentence boundaries unset. You can add the 'sentencizer' component to the pipeline with: 'nlp.add_pipe('sentencizer')'. Alternatively, add the dependency parser or sentence recognizer, or set sentence boundaries by setting 'doc[i].is_sent_start'.
Even though the "sentencizer" component was added to the pipeline before and when I print the pipeline inside the training loop I see, that it is activated.
May be the problem is, that I am incorrectly creating the optimizer?

Best,
Aleksandra

SofieVL · May 10, 2021, 10:26am

Hi Aleksandra, the Sentence boundaries unset message may show up when your training data doesn't contain sentence boundaries (as detailed in a post a bit higher up in this thread), but it's difficult to say for sure without seeing the code that assembles the pipeline and some actual sample data.

KeyError: "Parameter 'E' for model 'hashembed' has not been allocated yet."

This type of error typically means that a component in the pipeline wasn't initialized properly. Typically, you wouldn't have to call create_optimizer() directly, but you'd call initialize() or resume_training(). Again, it's difficult to judge exactly what is going on without a more extended code snippet or the full stack trace. I think that perhaps your entity_linker hasn't been initialized?

To be honest - it's getting kind of confusing to follow up on this because of the various different issues you're highlighting. I think it would be helpful to try and isolate one problem at the time, share a reproducible code snippet and then I can help debug what is going on. It'll usually just be a small detail somewhere, as the custom training loops can go wrong in subtle ways. This is also why we're advertising to use the training config system in spaCy 3, as it takes care of a lot of these details for you behind the scenes.

Considering that this whole discussion is not really Prodigy-related anymore, could you consider moving this to the spaCy discussion board and opening a focused topic per issue that you run into? That will help us put it in the right context, especially if there is some time inbetween posts (which is fine, ofcourse). A full reproducible code snippet and some sample data will definitely speed up our debugging efforts.

stewieboomhauer · May 10, 2021, 10:31am

Hi @SofieVL,

thank you a lot for your reply again and for all your help. I will then open a discussion on the spaCy discussion board.

Topic		Replies	Views
Entity Linking (prodigy training) usage , solved , nel	7	1038	September 11, 2024
annotating entities in text documents usage , ner , solved	15	9982	November 28, 2017
Prodigy 1.10 - custom recipe for entity linker adding extra text to the annotation sentences usage , spacy , solved , nel	3	493	June 2, 2021
Annotating custom entities in job descriptions usage , custom , hr	9	1181	June 2, 2019
Multi-word entity seeding, entity context usage , ner	19	3991	November 1, 2019

📺 Video: Training a custom entity linking model with spaCy & Prodigy

Related topics