That's already very helpful, thanks Aleksandra. The training loop you've posted certainly seems to disable the correct pipes. Is this the exact code you ran when you got that error message from the parser? Because after nlp_ru.disable_pipes(*other_pipes), the parser shouldn't actually be in the pipeline anymore. You could double check this by printing out nlp_ru.pipeline before and after that with statement.
You don't have to provide all the details of your knowledge base, but can you show a bit how your pipeline was constructed? Was it just the pretrained model from spacy, with a few additional pipes added? Which components did you (re)define?
Thank you again for answering! Yes, this is exactly the code, that I have been using through out time. As you suggested I printed nlp_ru.pipeline before and after that with statement. The line that disables other pipelines works well, because before the training loop, all components are still in the pipeline, whereas when I print it inside the training loop, only [('sentencizer', <spacy.pipeline.sentencizer.Sentencizer object at 0x7fab3efaf7d0>), entity_linker', <spacy.pipeline.entity_linker.EntityLinker object at 0x7fab1fab56b0>)] remain. Therefore I really don't get where the error comes from, when the Parser component is disabled.
As for pipeline construction:
Yes, I only use spaCy's pretrained model for Russian language and add two additional pipes to it, namely Sentencizer and Entity_linker:
Aleksandra: Inspecting your original error message once more, it looks like you're getting this error message from the ner , not the parser. Both use the transition_parser internally, but it looks like your error is related to ner.pyx specifically.
Still, from your code and from your output, it looks like also the ner component is disabled.
One more idea that comes to mind: is the training data somehow generated with the ner component? Do you use the ner anywhere at all? To be absolutely certain, you could try the following:
One more idea that comes to mind: is the training data somehow generated with the ner component? Do you use the ner anywhere at all?
Answering your question, I haven't used the ner component of the pipeline at all. I already had annotated data, so I didn't need to annotate anything by myself.
However, I am dealing with another problem right now. When I am using my old training loop, namely this one:
other_pipes = [pipe for pipe in nlp_ru.pipe_names if pipe != "entity_linker" and pipe !="sentencizer"]
with nlp_ru.disable_pipes(*other_pipes): # train only the entity_linker
print(nlp_ru.pipeline)
optimizer = entity_linker.create_optimizer()
for itn in range(10):
random.shuffle(train_dataset)
for raw_text, entity_offsets in train_dataset:
print(raw_text)
print(entity_offsets)
example = Example.from_dict(raw_text, entity_offsets)
print(example)
example.reference = nlp_ru.get_pipe("sentencizer")(example.reference)
entity_linker.update([example], sgd=optimizer)
I am getting the following error message: KeyError: "Parameter 'E' for model 'hashembed' has not been allocated yet."
I found this discussion thread and tried not to disable "toc2vec" component as it was said in the discussion, but unfourtanetly got the same error message.
I also tried to simplify the training loop and used this instead:
other_pipes = [pipe for pipe in nlp_ru.pipe_names if pipe != "entity_linker" and pipe !="sentencizer"]
with nlp_ru.disable_pipes(*other_pipes):
print(nlp_ru.pipeline)
optimizer = nlp_ru.create_optimizer()
examples = []
random.shuffle(train_dataset)
for text, annots in train_dataset:
try:
examples.append(Example.from_dict(text, annots))
except:
pass
losses = nlp_ru.update(examples, sgd=optimizer)
With this loop I get the error about sentence boundaries again: ValueError: [E030] Sentence boundaries unset. You can add the 'sentencizer' component to the pipeline with: 'nlp.add_pipe('sentencizer')'. Alternatively, add the dependency parser or sentence recognizer, or set sentence boundaries by setting 'doc[i].is_sent_start'.
Even though the "sentencizer" component was added to the pipeline before and when I print the pipeline inside the training loop I see, that it is activated.
May be the problem is, that I am incorrectly creating the optimizer?
Hi Aleksandra, the Sentence boundaries unset message may show up when your training data doesn't contain sentence boundaries (as detailed in a post a bit higher up in this thread), but it's difficult to say for sure without seeing the code that assembles the pipeline and some actual sample data.
KeyError: "Parameter 'E' for model 'hashembed' has not been allocated yet."
This type of error typically means that a component in the pipeline wasn't initialized properly. Typically, you wouldn't have to call create_optimizer() directly, but you'd call initialize() or resume_training(). Again, it's difficult to judge exactly what is going on without a more extended code snippet or the full stack trace. I think that perhaps your entity_linker hasn't been initialized?
To be honest - it's getting kind of confusing to follow up on this because of the various different issues you're highlighting. I think it would be helpful to try and isolate one problem at the time, share a reproducible code snippet and then I can help debug what is going on. It'll usually just be a small detail somewhere, as the custom training loops can go wrong in subtle ways. This is also why we're advertising to use the training config system in spaCy 3, as it takes care of a lot of these details for you behind the scenes.
Considering that this whole discussion is not really Prodigy-related anymore, could you consider moving this to the spaCy discussion board and opening a focused topic per issue that you run into? That will help us put it in the right context, especially if there is some time inbetween posts (which is fine, ofcourse). A full reproducible code snippet and some sample data will definitely speed up our debugging efforts.