Custom multi-word NER model pipeline

Hello,

I have a use case that I thought I had under control, but I just used ner.batch-train and did not get the results I was expecting. I'd like to talk through my pipeline and see if anyone had ideas where this is going wrong.

High-level overview:
I am trying to train a custom NER model to recognize job titles. I will probably expand this later to include another new entity type, but for now I just want to get it working on one. I have a large list of job titles that I want to use as patterns and a large collection of emails that I can treat as training data.

Step-by-step:
First, I am taking my .txt file of job titles and converting it to a .jsonl pattern file with my new entity name. Here is an example:

{"label": "TITLE", "pattern": "Manager, International Tax"}

These range from a length of 1 token to 10 tokens (since that is the max that SpaCy will accept when parsing these down the line). This file is provided_titles.jsonl.

Now I want to generate some more "TITLE" annotations. I create a new dataset called new_title_terms and use ner.manual to go through my collection of emails (data.txt) and create new annotations:

prodigy ner.manual new_title_terms en_core_web_lg data.txt --label TITLE

I'm now combining these multi-token title patterns with the ones I already have to create a new patterns file. This is possible thanks to @Stephan and his terms.manual-to-patterns code here.

prodigy terms.manual-to-patterns new_title_terms manual_title_patterns.jsonl
cat provided_titles.jsonl manual_title_patterns.jsonl > all_title_patterns.jsonl

I then create a new dataset training_title_terms.

prodigy dataset training_title_terms

This is where I start getting less confident that I'm taking the right approach. My understanding is that I need to reject incorrect examples in ner.teach instead of just accepting the things that are correct - otherwise I'll get good recall but horrible precision since the model will learn to label everything as a TITLE.

Now I run the below code to start augmenting the en_core_web_lg model with examples so it can learn TITLE:

prodigy ner.teach training_title_terms en_core_web_lg data.txt --label TITLE --patterns all_title_patterns.jsonl

I go through and accept/reject about 2,500 examples. The majority of these are rejected, so I do have imbalanced classes here. Now I run ner.batch-train to retrain and export the model.

prodigy ner.batch-train training_title_terms en_core_web_lg --output-model ./model --eval-split .2 --label TITLE

This achieved about 50% accuracy, which I would consider suboptimal. Here are the things that I think might be giving me bad results:

  • Imbalance between "accept" and "reject" examples in ner.teach. Is this relevant here? Since the majority of the spans the model is labeling based on all_title_patterns.jsonl are wrong, I would have to skip most suggestions during teaching to get these balanced out.
  • Using the en_core_web_lg model instead of a new blank model. I'd like to have one model that can identify all the standard SpaCy entities plus a couple that I teach it, but maybe this requires separate models instead of a pre-trained one with enhancements.

I'll be very grateful for any suggestions you have on improving this pipeline. This is the result of reading through a lot of Support topics and documentation, but clearly there's still something here I'm not catching. Thanks for your help!

Thanks for the detailed description. It sounds like you’re on the right track, and hopefully we’ll be able to get the model working. On the other hand, it could be that this is simply a problem that the model struggles with. It sounds like the phrases that make up your job title category might be quite diverse, which combined with the class imbalance could make the category difficult to learn.

I would definitely try basing the model off en_vectors_web_lg instead of en_core_web_lg. What you’re defining as an “entity” here is pretty fundamentally different from what the model has been trained to recognise. It’s learned to recognise definite descriptors: names of people, places, events etc. Your phrases are really not names, so you’re starting out with a model that’s really pretty certain these phrases are supposed to be labelled O. It’ll be really hard to get the model to reconcile what it knows about the other entity categories with what you’re trying to teach it about this new phrase type.

At runtime you should be able to have your model running before the built-in NER component, and in theory things should work. There were a couple of bugs in the v2.0.x releases that have been fixed in the forthcoming v2.1 release. So, it’s possible you’ll hit a temporary snag – but I think it should work.

Anyway. You can try running ner.batch-train with the en_vectors_web_lg model to see how you go. It might make a big improvement. You can also try creating some fully annotated examples, which will let you use the --no-missing flag in ner.batch-train, which further improves accuracy. I’ll wait to see how you go with the en_vectors_web_lg model before elaborating on that though.

1 Like

Thanks a lot for your help, @honnibal. I've tried en_vectors_web_lg with ner.batch-train, and unfortunately it didn't make much of a difference in performance (still achieving maximum accuracy of under 60%). I tried this with annotations created by ner.teach in the format that I described above, e.g.:

{“label”: “TITLE”, “pattern”: “Senior Manager, International Tax”}

As well as in a new format that I thought should make it easier for models to learn these new entities:

{“label”: “TITLE”, “pattern”: “Senior Manager”}
{“label”: “DEPT”, “pattern”: “International Tax”}

There didn't seem to be a real difference in ner.batch-train performance between the two annotation styles and my previous approach when run these with en_vectors_web_lg. I have a couple questions about your suggested approach to make sure I'm understanding your guidance correctly.

I would definitely try basing the model off en_vectors_web_lg instead of en_core_web_lg

Should I take this to mean that you would just use en_vectors_web_lg during the ner.batch-train step, or are there earlier steps where you recommend using it? In the pipeline I'm currently working with, this could mean using it with ner.manual (which I guess really just uses a model as a tokenizer, so the model probably doesn't matter) or with ner.teach (which will error out since it doesn't have an NER component). I'm assuming en_vectors_web_lg doesn't need to be used with these steps, but now I'm wondering what model I should use. If this will not impact the eventual output model of ner.batch-train and I can just use en_core_web_lg with those steps, then I'm not sure where this is going wrong now - I have added more annotations and tried this pipeline, and it seems to be achieving results consistently around 70% accuracy if I batch-train the TITLE label and lower if I just do DEPT or do both at once. While this is an improvement, it doesn't feel like much of one compared to the time I have spent generating new patterns with ner.manual and annotations with ner.teach.

One thing I've noticed that I think might be important is that ner.teach is frequently suggesting phrases as a label and then parts of that phrase as the same label. For example, it has labeled the span Associate Buyer with TITLE and then shown the same string of text with just the Buyer span labeled with TITLE. These do both show up in my all_title_patterns.jsonl file since both are valid titles. (This comes up with other phrases as well - Sourcing Manager and Manager, for example). I have been handling these in ner.teach by accepting if the correct entire phrase is labeled and rejecting if only a part of it is (Associate Buyer is correct if that's the full title, Buyer is incorrect if that's only part of the title and correct if that's the full title). I assume this is the right way to handle this but want to verify since ner.teach doesn't seem to be to suggesting multi-word TITLE and DEPT phrases that aren't in the patterns file.

Does this seem like a use case for --no-missing and does that sound like the right approach with incomplete multi-word entities in ner.teach?

Thank you.