add new lables as per new data received to existing data set and retrain the NER model

ryanwesslen · August 26, 2022, 3:15pm

That's a bit of a tricky question.

One of the first questions you should ask isn't really a ML question but a business question: what's the expectation on the frequency/timing for new NER labels? and when changed, how are annotation guidelines updated so that your annotators have clear definitions of the new labels?

If you have a model in production, I would caution against an expectation that you can add/retrain many times and on an ad hoc (non-regular occuring) basis (it would be okay if you're only in model development developing the model though). The reason is it may be very difficult to track changes so you likely should agree with your model users (stakeholders) on fixed times to add new labels (e.g., once a month).

Along those same lines, what's important (and sometimes forgotten) is to ensure you have a clear definition of what you're labeling through explicit annotation guidelines. This can be as simple as definitions of your entities. One great example is from the Guardian, who published an article about how they used Prodigy for a ner model:

They published their code and their annotation guidelines. As they discussed, it's important to have routine discussions with your modelers and business stakeholders to constantly update those guidelines.

The best example of this philosophy is in Matt's 2018 talk where he talks about successful ways of defining the business problem, the need for clear annotation guidelines, and role the that an iterative approach can aid in resulting in successful vs. unsuccessful ML projects:

Now I suspect you're more interested in how to implement updating NER labels and retrain the model, there are a lot of past Prodigy Support issues and documentation that can help.

First, I'd start with our NER workflow. We're hoping very soon to update this as some of the Prodigy recipe names have changed (e.g., train now instead of batch.train). But the main idea still holds. An important part is whether you're using a pre-trained model (e.g., en_core_web_sm) to update and/or add new existing entity types or starting a new model. What's important is that we recommend if you're adding more than 3+ new entity types, you're likely better off starting from scratch.
If you have some prior knowledge about your new entity (e.g., terms that are related), you should also consider adding match patterns to help bootstrap your entities. This will make it easier to label as these patterns will come highlighted.
If you are starting from scratch and simply want to create a workflow for a few new entities, I like my teammate's @ljvmiranda921 advice in this post that is very similar to your question:

It's also good to be aware that there are ways to create "nested" NER labels:

Last, it's important to know that Prodigy wasn't designed to create "labels-on-the-fly" and that label sets are fixed for each annotation session. You can definitely add new labels in between annotation sessions, but I want to make sure you understand that Prodigy isn't designed to do this within annotation session.

Be sure to continue searching on Prodigy support. There may be several other posts -- I found these after a quick search.

I hope this helps and let us know if you have further questions!

Topic		Replies	Views
Add more 3 new entity type usage , ner	4	647	November 1, 2019
Best strategy for training an NER engine usage , ner	8	2177	December 27, 2017
Adding new label usage , ner	5	1339	November 8, 2021
Training few new entities: Result very low usage , ner , spacy	3	17	January 29, 2025
Train multiple NER from a blank FR model using fastext vectors usage , ner , spacy	12	857	March 24, 2020

add new lables as per new data received to existing data set and retrain the NER model

Related topics