Training few new entities: Result very low

Hi All,

I am pretty new to Prodigy, AI and LLM. Tring again after a few years now.

I am trying to train and build a domain specific model (recruitment). I have annotated about 20K records, trained them but the accuracy is very bad. Any suggestions on what I can do to improve the same.

Here is what I have done. I have identified that I would need about 13 NER entities, which are not present in the standard en_lg model.

Since annotating 13 entities in a single document is a tedious job, I plan to do that separately for each entity.

  1. Converted all the documents to jsonl format - About 25K records in a file
  2. I started with en_core_web_lg model to annotate.
    python -m prodigy ner.correct ner_person en_core_web_lg data.jsonl --label PERSON --unsegmented
  3. Trained the model
    prodigy traing mymodel --ner ner_person
  4. After the first, I annotate the existing model
    prodigy ner.correct ner_org mymodel/model-best/ data.jsonl --label ORG &
  5. Then trained the model again for ORG and then annotated again with another entity.

Now, if I use this model (mymodel) in my project, the results are not good.

Am I missing some important step? Or is there something that I can do to improve the whole process.

Thanks.

Hi @tushar ,

If I understand correctly you are re-training the same model every time you add a new NER dataset.
Training entities sequentially like so could be problematic because:

  1. the model may forget previously learned patterns when trained on new entities (aka catastrophic forgetting)
  2. the model might have conflicting annotations across different training rounds (e.g., if "John Smith Technologies" was labeled as PERSON in one dataset and ORG in another)

Also, it's not entirely clear me what is the advantage of using a pretrained spaCy pipeline if the NER categories you're after are not covered by the pretrained model? If you use this model as the base model in training and there's an overlap between your added annotations and the pretrained model annotations that could also lead to catastrophic forgetting.

It's totally fine to collect annotations for different entities separately, but you should train your final model with annotations for all categories present and any potential conflicts resolved (train and data_to_spacy recipes will take care of merging the datasets and will also resolve potential overlapping spans by choosing the longest one).
You should also make sure that your final dataset is balanced in that each category is well represented.
It's often useful to print some general stats on how many instances of label you have.

In summary the recommend procedure would be:

  1. annotate your 13 NER entities with ner.manual (as I mentioned above not sure what the advantage is of using a pretrained spaCy pipeline in for the ner.correct workflow)
  2. train your first model with all the NER datasets
  3. evaluate for example using our OS plugin Prodigy evaluate
  4. run ner.correct with the model trained in 2) to improve further

If you already annotated all your categories, just try retraining the model but with all the NER dataset present i.e. starting from step 2 above but with your current NER datasets.

1 Like

Hi @magdaaniol ,

Thanks a ton for such a detailed answer with explanation. Helped me a lot.

I'll try to answer the queries / questions here and the challenges, I am still facing.

  1. Using existing model: Out of 13 entities, I had two - PERSON, ORG. Since they are names, I thought of correcting the existing ones.
  2. I have just trained three entities as of now and saved it to the datasets: ner_per, ner_org, ner_course .
    Now the challenge is that if a train a model using the three datasets, the results are extremely bad.
    However, if I create separate models, with each dataset and use these three datasets to create three instances of models in a program, the results are far better.
    Could this be because of the balance of representation of data points in each dataset ?

Thanks again for your help.

Hi @tushar!

Okay, yes - that makes perfect sense. You can use the spaCy pretrained NER model to annotate these two labels for you and use the corrected dataset for training your custom model.

As for the poor performance of the combined model vs the individual models:

It could indeed be related to the balance of data points.
If you have significantly different amounts of training data for each entity type (e.g., many person names but few course names), the model might become biased toward the majority class during joint training. In separate models, each entity type gets its full attention without competing with others.

Furthermore, as we already discussed, some tokens might represent different entity types depending on context (e.g., "Stanford" could be both an organization and a course). When training separate models, each one specializes in its specific context patterns. In a combined model, these ambiguous cases might create conflicting training signals.

Finally, a joint model needs to learn a more complex decision boundary to distinguish between all entity types simultaneously. Separate models only need to learn binary decisions (is this entity type X or not?), which is often an easier task.

In any case it sounds like your categories are rather well separated semantically.
One thing I would do is to run some analytics on the annotated datasets to check the distribution of each category.
Second, you could export your data to spaCy docBin format using data-to-spacy command and then run spacy debug data on it to see if there are any structural issues with the dataset.
Finally, you would have to do some error analysis of your model predictions to see if there are any patterns there.
As already mentioned Prodigy Evaluate plugin has some tools to help with model's evaluation.

Could you share the number of labels for each entity type in the joint dataset? That would help confirm if data imbalance is the primary issue.