Pre-trained model vs training a model from scratch?

jyjen · June 21, 2018, 8:55am

Hi,

My team is using Prodigy to train an Named Entity Recognizer, and we have a couple of questions about the use of pre-trained models in the NER. Just for reference, we have a few classes which are subclasses of the Entity labels in the SpaCy pretrained models (e.g. country, city and state) and a few which are not included in any of the base models. We were wondering:

What should we consider when deciding to train a new model vs using a pre-trained model? We’re aware that the training corpus used to train the base model should generalize well to the text we’re currently performing classification on, but how close of a match does it need to be? And what other considerations may come into play?
How does Prodigy put a pre-trained model into the loop (what’s going on under the hood)?
If we load a pre-trained model and train a NER with additional entities/entities which are not included in the base model, how does the pre-trained model come into play?

Thanks!

Cheers,

Jen

honnibal · June 22, 2018, 8:11am

Hi Jen,

As a bit of background, the key reason training on top of the pre-trained models is useful is that we can't distribute the original training data to you. So, if you can easily get the OntoNotes 5 corpus, that will avoid this decision. However, a commercial license to OntoNotes 5 requires commercial LDC membership, which costs $25,000. Because of this, it's often worth exploring approximations.

Imagine the limit case, where you had one example to add to the pre-trained model. We can add this example by making a single weight update. However, we still might not get this example right. What to do? If we just iterate over it until we do get it right, we're solving for a new objective -- "get this one example right".

A better solution is often to parse a bunch of your text with the pre-trained model, and update with that as well. The intention is to avoid changing the prior behavior of the model.

If the pre-trained model only has weak opinions about the new examples being added, it's probably possible to just update the pre-trained model, and find a solution that accommodates them without changing the behavior too much. However, if the model is confidently wrong about them, the new updates will require more extensive changes, so it'll be important to at least represent the previous behavior, so that the new objective makes sense.

Unfortunately for you, splitting an entity type into subtypes is likely to require a lot of changes to the weights. The model will have to learn to reclassify examples it was confident about.

In summary I think you'll probably want to retrain. However, I think you can probably use the pre-trained model to help you quickly boot-strap the annotation data to do that.

This depends on the recipe. You can read the source of the recipe functions; they're provided in your Prodigy installation. As a quick summary:

ner.manual: Doesn't use the model, only the tokenizer.
ner.make-gold: Predicts the most likely entities on the text, and displays them for you to correct.
ner.teach: Uses beam-search to find a beam of k-best NER analyses of the whole text. The suggestions are used to ask you binary questions. Each batch of answers is then used to update the model. The objective of the update is to assign high scores to analyses consistent with your binary feedback, and low scores to analyses inconsistent with your feedback.

jyjen · June 25, 2018, 2:22am

Hi Matt,

Thanks for the speedy response! I have a couple more questions:

Sounds good~ Just to clarify though, what did you mean by using the pre-trained model to boot-strap the annotation data?

Also, would it make sense to load the pre-trained word vectors in the en_vectors_web_lg model and retrain the rest of the NER?

Possibly a silly question, but does this mean that if we want to use a pre-trained entity recognizer, we're limited to (in some sense limited by) using/selecting from the entities which the model was trained on?

I'm slightly unclear about what happens when a pre-trained model is used to recognize a new entity - is similar to what happens with a SpaCy model when adding a new entity type (i.e. that the final output vector is expanded by the number of additional entities, and the argmax is taken over the resulting probability distribution)? What if we're just interested in entities which the pre-trained NER was not trained on?

Thanks!

Cheers,

Jen

honnibal · June 27, 2018, 9:04am

Let's say you want to split GPE into COUNTRY, CITY, MISC. You can at least pre-tag a bunch of text with the initial model, and only annotate the examples it's labelled as GPE, initially. You could do this in the textcat interface. You probably also want to group the examples, so that you only have to annotate "America" once. If some of your phrases are ambiguous, you could flag them. Alternatively, if you do want to annotate ever instance rather than every type, it'll be efficient to order the queue so that you do all the "America" instances at once. This way you can click through quickly.

Of course, you can still have countries or cities which the model didn't initially tag as a GPE. But doing this first step of correction will give you a lot of examples quickly, so you can get the initial model trained. Once it's working, you can either use the ner.teach or ner.make-gold interface to fill in the missing entries.

If we're only interested in entity types that aren't in the initial model, there's not much to gain from resuming training. It's probably going to hurt more than it helps.

We can still start teaching the model entity types it's not trained with, using the ner.teach interface. But to do that, you need to specify a patterns file. The patterns file will be used to start suggesting some entities of the new type. Once you've accepted some of these suggestions, they're used as examples for the model, so it can start making the suggestions.

Topic		Replies	Views
Train multiple NER from a blank FR model using fastext vectors usage , ner , spacy	12	860	March 24, 2020
Labeling sequence labeling (e.g. NER) task from scratch ner , spacy	16	3496	October 22, 2017
Best strategy for training an NER engine usage , ner	8	2186	December 27, 2017
Transfer Learning for NER usage , ner	6	2517	May 24, 2021
Training, pretraining best practices and deeper understanding usage , best-practices	3	973	October 24, 2019

Pre-trained model vs training a model from scratch?

Related topics