Trained Model is not Generalized

ta13 · January 19, 2022, 5:32am

I have build a NER model using Prodigy. I have annotated the data based on dictionaries. Now, the model can predict the entities that are given in the dictionary but the model fails in two cases:

Suppose the Entity is Movie and it is trained Titanic as a Movie. When I use Titanics / Titanicc instead of Titanic it cannot identify it as a movie entity.
Suppose Game of Thrones is not in the dictionary but other types of movies are. When an user use Game of Thrones why the model don't understand that it can be a movie?

How can I tackle these two issues?

ljvmiranda921 · January 19, 2022, 10:12am

Hi @ta13,

I'm curious as to (1) how many examples have you collected and (2) what kind of entities are you training for and (3) what dataset are you working on (e.g. tweets, reddit comments, etc.)? Generally, even if an NER model can generalize based on the context, it still needs enough representative examples.

Usually, NER models look into the context or semantics of the word to determine its entity. As a contrived example, when we look for MOVIE, it's possible that the NER model looks for instances of the word "watching" (we're watching X), "theatre" (we went to the theatre for X), etc. to determine that a particular word / token is a MOVIE. Thus, to properly train an NER model, we need to provide samples that can help surface that pattern. It's not simple matching, but of course it won't hurt to have frequent entities covered explicitly in the data.

So to answer your questions:

Perhaps the dataset lacks generalizable instances of "Titanic." You can make do with some data augmentation techniques (e.g. using skweak, nlpaug, etc.) to improve your results.
Similar to the one above, your dataset should have enough representative examples of what you're looking for. If it doesn't have samples pertaining to "Game of Thrones" or "GOT", then it may not be able to detect that later on. Even if the NER model looks for the context, it pays to have these frequent entities show up in your dataset.

ta13 · February 10, 2022, 11:28am

Hi

Thanks for your comment. But the problem I am facing is that the model is not learning well. I have total 17 entities. I have pre-annotated with the entities dictionaries. But it's hard for model to predict non-dictionary entities. It's not learning well . What is the best way to achieve this task. To train the model I am using prodigy train using is using spacy's custom config and evaluate it with 20% data. The acc is high like, 99%, maybe because the train and test both data covers the dictionary word but when I provide new example of entities, it's tough for model to predict. I need a suggestion here.

Thanks.

ljvmiranda921 · February 11, 2022, 4:10am

Hi @ta13 ,

If the model isn't generalizing, my guess is that the dataset you're training on is not representative, or the 17 entities are ambiguous / difficult to predict (some of them may "semantically overlap" and it's hard to differentiate one over the other).

If the 99% accuracy is coming from the evaluation (dev) data, then perhaps what you're testing on is not entirely representative. You might want to update your evaluation data to better represent the examples you'll see at runtime.

In summary, there are two ways you can go with this: (1) revisit your training and evaluation data if they're truly representative of the test data and (2) revisit your 17 labels and corresponding annotations if they're correct and consistent.

Topic		Replies	Views
Prodigy created model does not work usage , ner	2	742	November 9, 2018
Train multiple NER from a blank FR model using fastext vectors usage , ner , spacy	12	860	March 24, 2020
Incredibly poor training results ner , solved	5	749	October 25, 2018
ner.teach suggests spaces as entities? usage , ner , solved	13	1674	August 3, 2018
Labeling sequence labeling (e.g. NER) task from scratch ner , spacy	16	3495	October 22, 2017

Trained Model is not Generalized

Related topics