How to modify labels/entities in default models (en, en_core_web_lg, etc) and retrain

Dear all,

I’ve just started playing around with spaCy/Prodigy. The documentation is nice but there doesn’t seem to be an obvious way to modify the default labels/entities. I wish to start with one of the default English models and fine-tune it for the kind of texts that I will be processing. Specifically, I need to keep some of the default labels (DATE, GPE, etc) and add others (COMMODITY, AGENT, etc).

  1. How could I do this?

I also have a lot of text where these new labels/entities appear so I need to highlight them using Prodigy.

  1. What would the workflow look like?

EDIT 1:

I tried to follow the example here: French NER to initialize a model with no labels for English but it’s not working.

This is what I did:

python3 -m spacy init-model en ./lang/en_vectors_comm --vectors ~/Downloads/crawl-300d-2M.vec
Reading vectors from /home/user/Downloads/crawl-300d-2M.vec
Open loc
1999995it [03:14, 10274.76it/s]
Creating model…
0it [00:00, ?it/s]

Sucessfully compiled vocab                                                                                                                                                                                                                                                                                                                                                                                                                         
1999715 entries, 1999995 vectors   

Unfortunately, when trying to create the gold standard it fails:

prodigy ner.make-gold my_dataset ./lang/en_vectors_comm/ ~/Shared/my_corpus.jsonl --label ~/Documents/my_labels
Using 7 labels from /home/user/Documents/my_labels


  ✨  Starting the web server at http://localhost:8080 ...
  Open the app in your browser and start annotating!

12:31:28 - Exception when serving /get_questions
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/waitress/channel.py", line 338, in service
    task.service()
  File "/usr/local/lib/python3.6/dist-packages/waitress/task.py", line 169, in service
    self.execute()
  File "/usr/local/lib/python3.6/dist-packages/waitress/task.py", line 399, in execute
    app_iter = self.channel.server.application(env, start_response)
  File "/usr/local/lib/python3.6/dist-packages/hug/api.py", line 423, in api_auto_instantiate
    return module.__hug_wsgi__(*args, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/falcon/api.py", line 244, in __call__
    responder(req, resp, **params)
  File "/usr/local/lib/python3.6/dist-packages/hug/interface.py", line 793, in __call__
    raise exception
  File "/usr/local/lib/python3.6/dist-packages/hug/interface.py", line 766, in __call__
    self.render_content(self.call_function(input_parameters), context, request, response, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/hug/interface.py", line 703, in call_function
    return self.interface(**parameters)
  File "/usr/local/lib/python3.6/dist-packages/hug/interface.py", line 100, in __call__
    return __hug_internal_self._function(*args, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/prodigy/app.py", line 105, in get_questions
    tasks = controller.get_questions()
  File "cython_src/prodigy/core.pyx", line 109, in prodigy.core.Controller.get_questions
  File "cython_src/prodigy/components/feeds.pyx", line 56, in prodigy.components.feeds.SharedFeed.get_questions
  File "cython_src/prodigy/components/feeds.pyx", line 61, in prodigy.components.feeds.SharedFeed.get_next_batch
  File "cython_src/prodigy/components/feeds.pyx", line 130, in prodigy.components.feeds.SessionFeed.get_session_stream
  File "/home/user/.local/lib/python3.6/site-packages/toolz/itertoolz.py", line 368, in first
    return next(iter(seq))
  File "/usr/local/lib/python3.6/dist-packages/prodigy/recipes/ner.py", line 209, in make_tasks
    for doc, eg in nlp.pipe(texts, as_tuples=True):
  File "/home/user/.local/lib/python3.6/site-packages/spacy/language.py", line 548, in pipe
    for doc, context in izip(docs, contexts):
  File "/home/user/.local/lib/python3.6/site-packages/spacy/language.py", line 572, in pipe
    for doc in docs:
  File "/home/user/.local/lib/python3.6/site-packages/spacy/language.py", line 551, in <genexpr>
    docs = (self.make_doc(text) for text in texts)
  File "/home/user/.local/lib/python3.6/site-packages/spacy/language.py", line 544, in <genexpr>
    texts = (tc[0] for tc in text_context1)
  File "/usr/local/lib/python3.6/dist-packages/prodigy/recipes/ner.py", line 208, in <genexpr>
    texts = ((eg['text'], eg) for eg in stream)
  File "cython_src/prodigy/components/preprocess.pyx", line 118, in add_tokens
  File "cython_src/prodigy/components/preprocess.pyx", line 42, in split_sentences
  File "doc.pyx", line 535, in __get__
ValueError: [E030] Sentence boundaries unset. You can add the 'sentencizer' component to the pipeline with: nlp.add_pipe(nlp.create_pipe('sentencizer')) Alternatively, add the dependency parser, or set sentence boundaries by setting doc[i].is_sent_start.

Dear founders, Ines and Matthew, can you please help me out here?

Still on Christmas holidays, so we're not always online – happy holidays btw! :gift:

There are basically two possible paths here:

  1. Add more entity types to the existing model, e.g. start with a pre-trained model and update it with examples of the new entity types and some examples of the existing entity types (to prevent the model from "forgetting" them). In Prodigy, you could use a recipe like ner.make-gold to correct the model's predictions and add your new entity types manually.
  2. Start off with a blank model / a blank entity recognizer and train it from scratch with examples of all entity types you're interested in. In Prodigy, you could start with ner.manual and label everything from scratch, or use ner.teach or ner.match with patterns that describe the entities, to make it easier to get over the cold start problem and label faster by accepting/rejecting. You could also use ner.make-gold here btw with the labels you want to keep (faster, because the model will highlight them and you only need to correct the entities), add you new labels and then train a new model from scratch.

You might have to try both approaches to find out what works best for your use case. If your new categories overlap with categories the model previously predicted, or if you want to train a lot of new stuff in general, it's often not worth it to mess with the pre-trained models. You might have to change pretty much all the weights to teach it the new definitions, and you might end up with all kinds of confusing side-effects due to the existing weights. So it's often easier to start with a blank model and fresh annotations.

Have you tried the solution suggested in the error message? As the message says, the model needs to be able to split sentences, but it currently doesn't set sentence boundaries (because it has no parser and no other component for sentence boundary detection).

The recipe you're running will split the text into sentences (unless you're running it with --unsegmented), but the model you're loading in can't do this, so spaCy complains. To fix this, you can add the sentencizer, a pipeline component that does simple rule-based sentence segmentation. Just make sure you add it before you save out the model:

sentencizer = nlp.create_pipe("sentencizer")
nlp.add_pipe(sentencizer)
1 Like

Hi Ines,

Merry Christmas and Happy Holidays!

Thanks for the quick answer. I didn’t realize that I had to modify the recipe to break down the text into sentences. I’m still learning spaCy/prodigy. Would you recommend creating a new recipe based on make-gold?

Sorry if this was confusing – but no, you don’t have to modify the recipe! By default, Prodigy will split text into sentences wherever possible, but you can set the --unsegmented argument when calling ner.make-gold on the command line to turn this off. (If you do this, you need to make sure that your texts aren’t too long, though – otherwise, you might run into performance issues.)

Alternatively, you can also add the sentencizer component to your model so that the model is able to set sentence boundaries (see my code snipped above). You can do this in the code you use to create and export the model, just before calling nlp.to_disk.

Hello, I have added “sentencizer” but I still have this error, what should I do?

This is my code:
STOP_WORDS = stopwords.words(‘english’)
nlp = English()
nlp.add_pipe(nlp.create_pipe(‘sentencizer’))

def normalize(text):#对文本进行处理,返回列表,列表的每个元素时字符串
text = text.lower().strip()
doc = nlp(text)#doc具有了nlp的属性方法
filtered_sentences = []
for sentence in tqdm(doc.sents):
filtered_tokens = list()
for i, w in enumerate(sentence):#对每个句子中的字母进行处理,小写、除去标点和停顿词、’,‘换成’.’
s = w.string.strip()
if len(s) == 0 or s in string.punctuation and i < len(doc) - 1:# string.punctuation 所有的标点字符
continue
if s not in STOP_WORDS:
s = s.replace(’,’, ‘.’)
filtered_tokens.append(s)
filtered_sentences.append(’ '.join(filtered_tokens))

spacy version issue

change 2.1.3 to 2.1.0 is worded