Hello,
I have a use case that I thought I had under control, but I just used ner.batch-train
and did not get the results I was expecting. I'd like to talk through my pipeline and see if anyone had ideas where this is going wrong.
High-level overview:
I am trying to train a custom NER model to recognize job titles. I will probably expand this later to include another new entity type, but for now I just want to get it working on one. I have a large list of job titles that I want to use as patterns and a large collection of emails that I can treat as training data.
Step-by-step:
First, I am taking my .txt file of job titles and converting it to a .jsonl
pattern file with my new entity name. Here is an example:
{"label": "TITLE", "pattern": "Manager, International Tax"}
These range from a length of 1 token to 10 tokens (since that is the max that SpaCy will accept when parsing these down the line). This file is provided_titles.jsonl
.
Now I want to generate some more "TITLE" annotations. I create a new dataset called new_title_terms
and use ner.manual
to go through my collection of emails (data.txt
) and create new annotations:
prodigy ner.manual new_title_terms en_core_web_lg data.txt --label TITLE
I'm now combining these multi-token title patterns with the ones I already have to create a new patterns file. This is possible thanks to @Stephan and his terms.manual-to-patterns
code here.
prodigy terms.manual-to-patterns new_title_terms manual_title_patterns.jsonl
cat provided_titles.jsonl manual_title_patterns.jsonl > all_title_patterns.jsonl
I then create a new dataset training_title_terms
.
prodigy dataset training_title_terms
This is where I start getting less confident that I'm taking the right approach. My understanding is that I need to reject incorrect examples in ner.teach
instead of just accepting the things that are correct - otherwise I'll get good recall but horrible precision since the model will learn to label everything as a TITLE.
Now I run the below code to start augmenting the en_core_web_lg
model with examples so it can learn TITLE:
prodigy ner.teach training_title_terms en_core_web_lg data.txt --label TITLE --patterns all_title_patterns.jsonl
I go through and accept/reject about 2,500 examples. The majority of these are rejected, so I do have imbalanced classes here. Now I run ner.batch-train
to retrain and export the model.
prodigy ner.batch-train training_title_terms en_core_web_lg --output-model ./model --eval-split .2 --label TITLE
This achieved about 50% accuracy, which I would consider suboptimal. Here are the things that I think might be giving me bad results:
- Imbalance between "accept" and "reject" examples in
ner.teach
. Is this relevant here? Since the majority of the spans the model is labeling based onall_title_patterns.jsonl
are wrong, I would have to skip most suggestions during teaching to get these balanced out. - Using the
en_core_web_lg
model instead of a new blank model. I'd like to have one model that can identify all the standard SpaCy entities plus a couple that I teach it, but maybe this requires separate models instead of a pre-trained one with enhancements.
I'll be very grateful for any suggestions you have on improving this pipeline. This is the result of reading through a lot of Support topics and documentation, but clearly there's still something here I'm not catching. Thanks for your help!