I would like to test the dep.teach recipe available in Prodigy (the use case here is training a model that infers relationships between entities). I’m starting with a custom-trained NER model, in this case for Spanish.
When trying to start an annotation session, I get the following error:
ValueError: [E030] Sentence boundaries unset. You can add the 'sentencizer' component to the pipeline with: nlp.add_pipe(nlp.create_pipe('sentencizer')) Alternatively, add the dependency parser, or set sentence boundaries by setting doc[i].is_sent_start.
Running my_ner_model.pipe_names produced the following: ['ner'].
I revised my model by adding both the sentencizer and the parser pipeline components, but I’m now getting a cryptic Segmentation fault (core dumped) error. Adding memory on my VM does not solve this issue.
Is there a step/requirement here that I’m missing? Could you point me to any relevant logs that might help with sorting this out?
The dep.teach recipe is still a bit experimental, but it should work quite well for fine-tuning the accuracy of a pre-trained parser on a new dataset, for domain adaptation.
If you’re starting from scratch, things are a fair bit harder. Annotating trees from scratch is still a lot of work, and we don’t really have a better approach than the free solutions, which you can find here: https://universaldependencies.org/tools.html
We’re going to be doing dependency annotations ourselves as well, so we’ll be working on better solutions. But for now you should annotate at least 500-1000 sentences manually, train an initial model, and then try out the dep.teach recipe to progressively improve different labels.
Thanks for your quick reply, @honnibal! A follow-up question:
the Spanish language model (es_core_news_md) available at the moment has a pre-trained parser. Is this enough to get started with domain-specific training?
It should be — give it a try, and see how you go. This will only work if you’re using the same annotation scheme as the pretrained model, though. If you’re trying to learn custom relations, you’ll need to start from scratch.