ner.make-gold with en_vectors_web_lg model

Hello,

I can’t get the “ner.make-gold” command to work on the en_vectors_web_lg or a custom model trained on en_vectors_web_lg.

Using other models like en_core_web_lg works perfectly, just like using en_vectors_web_lg with other commands such as ner.teach works.

Here is the error I got:

    Traceback (most recent call last):
      File "/home/debian/anaconda3/lib/python3.6/runpy.py", line 193, in _run_module_as_main
        "__main__", mod_spec)
      File "/home/debian/anaconda3/lib/python3.6/runpy.py", line 85, in _run_code
        exec(code, run_globals)
      File "/home/debian/anaconda3/lib/python3.6/site-packages/prodigy/__main__.py", line 259, in <module>
        controller = recipe(*args, use_plac=True)
      File "cython_src/prodigy/core.pyx", line 178, in prodigy.core.recipe.recipe_decorator.recipe_proxy
      File "cython_src/prodigy/core.pyx", line 55, in prodigy.core.Controller.__init__
      File "/home/debian/anaconda3/lib/python3.6/site-packages/toolz/itertoolz.py", line 368, in first
        return next(iter(seq))
      File "cython_src/prodigy/core.pyx", line 84, in iter_tasks
      File "/home/debian/anaconda3/lib/python3.6/site-packages/prodigy/recipes/ner.py", line 209, in make_tasks
        for doc, eg in nlp.pipe(texts, as_tuples=True):
      File "/home/debian/anaconda3/lib/python3.6/site-packages/spacy/language.py", line 554, in pipe
        for doc, context in izip(docs, contexts):
      File "/home/debian/anaconda3/lib/python3.6/site-packages/spacy/language.py", line 578, in pipe
        for doc in docs:
      File "nn_parser.pyx", line 367, in pipe
      File "cytoolz/itertoolz.pyx", line 1047, in cytoolz.itertoolz.partition_all.__next__
      File "/home/debian/anaconda3/lib/python3.6/site-packages/spacy/language.py", line 557, in <genexpr>
        docs = (self.make_doc(text) for text in texts)
      File "/home/debian/anaconda3/lib/python3.6/site-packages/spacy/language.py", line 550, in <genexpr>
        texts = (tc[0] for tc in text_context1)
      File "/home/debian/anaconda3/lib/python3.6/site-packages/prodigy/recipes/ner.py", line 208, in <genexpr>
        texts = ((eg['text'], eg) for eg in stream)
      File "cython_src/prodigy/components/preprocess.pyx", line 106, in add_tokens
      File "cython_src/prodigy/components/preprocess.pyx", line 40, in split_sentences
      File "doc.pyx", line 528, in __get__
    ValueError: [E030] Sentence boundaries unset. You can add the 'sentencizer' component to the pipeline with: nlp.add_pipe(nlp.create_pipe('sentencizer')) Alternatively, add the dependency parser, or set sentence boundaries by setting doc[i].is_sent_start.

Thank you for your help !

Hi! If I read this correctly, the problem here is that you’re using a vectors-only model which doesn’t have an NER component. Prodigy and spaCy should definitely fail more gracefully here!

The main purpose of ner.make-gold is to check and correct the model’s existing predictions – but the vectors model doesn’t have any of that by default. Other models do, which is why they work without problems – and the terms.teach recipe only looks at the vectors and vocab, which is no problem for the vectors model.

If you do want to use the large vectors, you should be able to add blank components and then save out the model:

nlp = spacy.load('en_vectors_web_lg')
ner = nlp.create_pipe('ner')  # blank NER model
sbd = nlp.create_pipe('sentencizer')  # sentence boundaries, just in case
nlp.add_pipe(sbd)
nlp.add_pipe(ner)
nlp.begin_training()
nlp.to_disk('/path/to/model')

You can then use the model path in Prodigy. Keep in mind that this will still make working with ner.make-gold recipe difficult, because the model won’t know anything yet. So you either want to start with ner.manual and label everything from scratch, or use ner.teach with --patterns to get over the cold-start problem.

Thank you for your answer Ines.

I already trained my vector model to recognize 3 NER labels and it worked fine following the training tutorial on spacy’s documentation which added the ner to the pipeline.

Adding the sentencizer fixed the issue.

Thank you again for the quick support.