Help with building NER for job descriptions

Hi, I recently bought prodigy and think the design is very cool, I have since read through the docs and many of the support comments, but still seem to be hitting walls trying to accomplish the task at hand.
I want to design a keyword/keyphrase NER that extracts words/phrases that convey skills from job descriptions / resumes of any category and returns the relavent data.
I started by building a word_patterns.jsonl that included both words and multiple word phrases like the following (this contains about ~500 lines)
{“label”: “SKILL”, “pattern”: [{“lower”: “project”}, {“lower”: “planning”}]}
I then proceeded with using the ner.teach using a skill label and the patterns file for single sentences pulled from many resumes/job descriptions (should i do {“text”:“sentence from resume”} or {“text”: “entire resume/ job desc”} for input? As I was going through the loop the model continued to miss phrases that I was hoping it would catch. After going through the data, I proceeded to go through the same data with ner.make-gold, hoping to auto correct some of the missed phrase matches. After completing this, I ran the ner.teach through the same data and it continued to miss what was caught it ner.make-gold.
I may be doing this wrong but if there is any help you are able to offer to steer me on the right track I would greatly appreciate it!


I think you’re missing a training step here. You need to call ner.batch-train with the dataset you created using ner.make-gold.

Another thing is that you might be starting off with a model that’s got an initialization that’s not helping you. Your definition of “entity” is completely different from what the default NER model thinks of as an “entity”, so you might be fighting the initial model in your training. Try using en_vectors_web_lg as your initial model, so that you can train a model from scratch. If the size of the en_vectors_web_lg makes your experiments unwieldy, you can cut it down to a handier size like this: python -c "import en_vectors_web_lg; nlp = en_vectors_web_lg.load(); nlp.vocab.prune_vectors(20000); nlp.to_disk('./en_vectors_web_md')"

Finally, note that “skill” in resumes is a pretty slippery category. Once you get everything working, you might find you need to revise your category definitions a bit to make sure you’re working with a category you can annotate very consistently. You want something with very bright lines around what is and isn’t the category of interest. There will always be edge cases, but you want the edge cases to be very rare. If there are too many common cases that are borderline, the model will struggle.

Hi Honnibal,
I just want to thank you for your reply, your customer service for Prodigy is top-notch, definitely adds a ton of value to your product!
So just to be sure I am understanding correctly,
1.) Create a new dataset
2.) Start with my jsonl file of ~500 patterns, using the en_vectors_web_lg on the ner.make-gold
3.) Create a working model with my updated dataset using ner.batch-train
4.) If I want to improve this instead of calling the en_vectors_web_lg, I call the model I just created with ner.batch-train for the next iteration of sample datasets.
5.) Basically rinse and repeat step 3 & 4 until the model is working sufficiently
Does that sound correct?

Lastly, rather than throwing everything together on one NER label like “SKILL”, since all job categories creates an extremely brought size of variations. Would you recommend making multiple NER labels for different categories such as IT, MEDICAL, LAW, ETC… or possibly something like a SOFTSKILL (“communication, collaboration”) and HARDSKILL (“Java”, “Python”, “C++”)?

Thanks again!

Yes, exactly. The main idea here is that you want to get over the cold-start problem (where the model knows nothing), pre-train it so it predicts something and then use the existing model's prediction to collect better annotations to improve it, update the improved model with more examples, and so on.

You might have to experiment with a few different approaches to find out what works best. Maybe it makes sense to start off with annotating a few hundred examples by hand to give the model something to learn from, and then move on to binary annotation. Maybe it works best to use match patterns to suggest candidates in context and accept/reject them. Hopefully Prodigy makes it easy to run those experiments and quickly try things out.

I think you might want to generalise even one step further and label things like PROGRAMMING_LANGUAGE or PROGRAM. Predicting those basic "categories of things" based on the local context is often easier than trying to encode too much subtle information at once.

If you've trained a model that's good at predicting things like programming languages, software/tools and other things people might put on their CVs, you can then move on to the next step and decide whether those entities are in fact skills – e.g. by looking at their position in the document.

I discussed some ideas for an information extraction project on company reports in this thread - maybe some of these could be relevant to your project as well:

Btw, if you've browsed the forum, you might have seen this already – but if not, I'd definitely recommend checking out @honnibal's talk on solving different NLP problems and designing label schemes. The example around 11:38 is especially relevant:

Thanks for your response. So I built up label in a ner model that currently works satisfactory with one category field (say computer science), I haven’t trained it to much on other categories but briefly started with the HARDSKILL label on another category. But I’m concerned that I am going to overload this label with too much information because of all the different fields of work. I need to model to work across any random piece of data and extract the relevant keywords. Would you recommend making a new label such as HARDSKILL_HEALTH or HARDSKILL_EDUCATION and have those compete with eachother even though there would be some overlapping keywords? Or would you recommend taking in a bunch of data across all the categories on the initial label HARDSKILL in the model?

As a note, my method was using ner.make-gold and then batch-training the ner and then repeating the make-gold until I came up with a satisfactory model for the given category

Do you have an example of a “hard skill” in the health or education domain? I’m having a hard time imagining what those categories include.

If you already feel like the categories are too broad and the boundaries are too fuzzy, it’s possibly an indicator that the model may also struggle to learn the distinction. Generic categories are usually much better than very specific categories that take too many other aspects into account (e.g. domain, skill type etc). If the labels are ambiguous and difficult to annotate, you’re also much more likely to introduce inconsistencies and human error, and produce less useful data that way.

The task might become easier if you can come up with a set of top-level categories that work across domains (similar to programming languages, programs, tools etc.).

You could also try framing the domain (health, education) as a text classification problem and try to predict the topic of the whole job description. Doing this as a per-sentence or per-paragraph classification task might make sense here – many descriptions probably include a lot of sentences that are generic and not domain-specific, but some sections would likely score very high for the domain like EDUCATION or HEALTH. Based on those scores, you could then determine what the document is about and what domain (most of) the specific skills refer to.