add new lables as per new data received to existing data set and retrain the NER model

Pls guide me regarding this

hi @Vishal112!

That's a bit of a tricky question.

One of the first questions you should ask isn't really a ML question but a business question: what's the expectation on the frequency/timing for new NER labels? and when changed, how are annotation guidelines updated so that your annotators have clear definitions of the new labels?

If you have a model in production, I would caution against an expectation that you can add/retrain many times and on an ad hoc (non-regular occuring) basis (it would be okay if you're only in model development developing the model though). The reason is it may be very difficult to track changes so you likely should agree with your model users (stakeholders) on fixed times to add new labels (e.g., once a month).

Along those same lines, what's important (and sometimes forgotten) is to ensure you have a clear definition of what you're labeling through explicit annotation guidelines. This can be as simple as definitions of your entities. One great example is from the Guardian, who published an article about how they used Prodigy for a ner model:

They published their code and their annotation guidelines. As they discussed, it's important to have routine discussions with your modelers and business stakeholders to constantly update those guidelines.

The best example of this philosophy is in Matt's 2018 talk where he talks about successful ways of defining the business problem, the need for clear annotation guidelines, and role the that an iterative approach can aid in resulting in successful vs. unsuccessful ML projects:

Now I suspect you're more interested in how to implement updating NER labels and retrain the model, there are a lot of past Prodigy Support issues and documentation that can help.

  • First, I'd start with our NER workflow. We're hoping very soon to update this as some of the Prodigy recipe names have changed (e.g., train now instead of batch.train). But the main idea still holds. An important part is whether you're using a pre-trained model (e.g., en_core_web_sm) to update and/or add new existing entity types or starting a new model. What's important is that we recommend if you're adding more than 3+ new entity types, you're likely better off starting from scratch.

  • If you have some prior knowledge about your new entity (e.g., terms that are related), you should also consider adding match patterns to help bootstrap your entities. This will make it easier to label as these patterns will come highlighted.

  • If you are starting from scratch and simply want to create a workflow for a few new entities, I like my teammate's @ljvmiranda921 advice in this post that is very similar to your question:

  • It's also good to be aware that there are ways to create "nested" NER labels:
  • Last, it's important to know that Prodigy wasn't designed to create "labels-on-the-fly" and that label sets are fixed for each annotation session. You can definitely add new labels in between annotation sessions, but I want to make sure you understand that Prodigy isn't designed to do this within annotation session.

Be sure to continue searching on Prodigy support. There may be several other posts -- I found these after a quick search.

I hope this helps and let us know if you have further questions!

I have new categories to train and don't want to train the model from scratch, currently I am testing my model in local environment and the data that I have annotated is very large hence I want to skip this step and want to add new lables for new categories I received

hi @Vishal112!

If you decide to retrain (i.e., don't start with a blank:en but a previously trained model), be aware, you'll likely run into problems of catastrophic forgetting. There's not an easy solution to this except being aware that if you retrain with an imbalance of examples (e.g., only retraining on new NER entities), your model may forget the old entities.

no this is not my concern @ines

consider my scenario
I have 2000 email data with subject(category/Lable) I have annotated the data manually ner.manual receipe
then I have trained the model and I am using it

now I got another 1000 email data with different lables or categories now I dont want to annotate that data which I have annotated previously so what I supposed to do now

pls guide

one more thing pls address this issue too why I am getting all label together
image

Hi @Vishal112,

If your new annotation round has different labels, why are you trying to use the model you previously created? Why not create two separate models? Is there overlap in the labels? If so in what way?

I suspect there's an issue with your patterns. Can you provide what your patterns look like? That span is hitting your 3 match pattern (see bottom right) -- you likely are having issues with that pattern.

Is your ultimate goal an intent model for chats?

If so, why not use a text classification UI instead of a NER for your intent annotations?

Most intent-models do have an accompanying NER model but those NER provide context/additional to help take an action on the intent. You seem to be using an NER for intent (i.e., whether this is account-related, credit-card-related, etc.).

No data is same of email bodies but categories are different