Extracting skills from job postings

Hi! I am trying spaCy and Prodigy for the first time, I would like to make a NER model to extract skills from english text. I have some doubts about the correct workflow I should follow for this task.

These are the passages I've done so far:

  1. Created a new dataset with the command:
    python -m prodigy dataset linkedin_skills_dataset "NER skills dataset"

  2. Start NER manual annotation from a a text file (one sentence per line):
    python -m prodigy ner.manual linkedin_skills_dataset en_core_web_sm D:\jobs.txt --label "SKILL"

  3. Export the annotations:
    python -m prodigy db-out linkedin_skills_dataset /tmp

At this point I notice something strange in the output annotations. For example for the following sentence
We are looking for an expert Hadoop developer. where I've annotated the word Hadoop with the label SKILL, I got the following in the output file:

{"spans":[{"token_end":6,"end":35,"token_start":6,"label":"SKILL","start":29}],"answer":"accept","_view_id":"ner_manual","_input_hash":240160175,"tokens":[{"id":0,"start":0,"end":2,"text":"We"},{"id":1,"start":3,"end":6,"text":"are"},{"id":2,"start":7,"end":14,"text":"looking"},{"id":3,"start":15,"end":18,"text":"for"},{"id":4,"start":19,"end":21,"text":"an"},{"id":5,"start":22,"end":28,"text":"expert"},{"id":6,"start":29,"end":35,"text":"Hadoop"},{"id":7,"start":36,"end":45,"text":"developer"},{"id":8,"start":45,"end":46,"text":"."}],"text":"We are looking for an expert Hadoop developer.","_task_hash":1289245207,"_session_id":"linkedin_skills_dataset-default"}

Is it normal that all the tokens are included in the output? This seems different from the annotated file of this tutorial: Improve a Named Entity Model. This is in particular the first line:

{"text":"This was taken during the Easter celebrations at Real de Catorce, MX.","spans":[{"start":66,"end":68,"text":"MX","rank":0,"label":"PRODUCT","score":0.9525150387,"source":"core_web_sm","input_hash":19964311,"answer":"reject"}],"meta":{"section":"photography","score":0.9525150387},"_input_hash":19964311,"_task_hash":1331479932,"answer":"reject"}

From what I can see the tokens are not included, only the spans. Am I missing some parameter in the output command?

Also I would like to ask if for this task I should use a blank model or it is ok to start with pre-existing spaCy model like en_core_web_sm. If it is better to start with a blank model, which is the right command to do so?
I've tried omitting the spacy_model parameter in the ner.manual command but got an error:

python -m prodigy ner.manual linkedin_skills_dataset D:\jobs.txt --label "SKILL"

Error: -> OSError: [E050] Can't find model 'D:\jobs.txt'. It doesn't seem to be a shortcut link, a Python package or a valid path to a data directory.

Thank you for any info!

1 Like

Hi! Your workflow looks good so far :slightly_smiling_face:

The "tokens" in the task data are only used in the ner.manual recipe and interface – you don't actually need them during training (assuming your tokenization rules don't change).

Having tokens available during manual annotation lets you highlight in a more "lazy" way – you can double-click a single word, or highlight half a word and the selection snaps to the token boundaries. And you'll also be able to spot tokenization issues that could lead to problems in the model. So the recipe adds the tokens for annotation, and the propery then just stays in the data.

Btw, this is also why you saw the error when you left out the model name: The ner.manual recipe needs some model that it only uses for tokenization. The model isn't updated as you annotate – that's something you can do in a later step in ner.batch-train.

If you want to start off with a blank model in ner.batch-train, you can either save it out in spaCy by running something like spacy.blank("en").to_disk("/path/to/model") and then using that path as the model argument. The latest version of Prodigy also supports a shortcut for this in ner.batch-train: instead of the model name, just pass in blank:en.

Which approach you choose (existing model or blank) depends on the categories you want to add, and whether you want to keep the existing labels of the pre-trained model or not. Starting from scratch has some advantages, because it's "cleaner" and means you're not "fighting" any side-effects from the existing weights (e.g. if your new labels overlap with existing labels like PRODUCT). So if you can, maybe try it with a blank model first – you can always run experiments with a pre-trained model later.

Also, some related tips, in case you haven't seen them yet. You might find my NER annotation flowchart useful:

We also have a new video series on spaCy that shows and end-to-end workflow for detecting programming languages and technologies (which sounds kinda similar to your task). Here's the first episode:

2 Likes

Hi Hines, thank you very much for the great answer!

I've been successfull to label and train an empty model following your guidelines.

Do you think makes sense to use ner.manual for some examples and then use ner.teach to start from a partially pre-trained model?

I am trying to extrack skills from jobs postings on linkedin, should I keep my examples at sentence level or is it better to use the whole job posting as an example for the labeling phase?

An example of job text is the following:
<<
DATA ENGINEER
Looking for qualified Data Engineer’s to join an innovative team in Charlotte, NC, right outside of uptown. This engineer will be supporting the company’s rapidly expanding Digital Transformation products. The Select Group is looking for someone who experience working within Big Data and has strong knowledge of Hadoop ecosystem. MUST be able to work on a W2 basis to be considered.

DATA ENGINEER REQUIREMENTS

  • Data engineer utilizing big data (specifically Hadoop)
  • Strong knowledge of Hadoop ecosystem; working with Hive, HBase, PySpark, Spark
  • Ability to build data frameworks, data ingestion
  • Experience writing ETL
  • AWS knowledge, specifically with EMR (cloud native bid data platform)

DATA ENGINEER RESPONSIBILITIES
The Data Engineers will join the company’s Data Engineer Practice to support several products that are in production. Their Big Data environment is mainly within Hadoop and they are in the process of implementing AWS as well. The company’s environment is purely agile and they look for innovators who have a passion for growth and technology.

Thank you for any advice!

1 Like

If you're training an NER model, it'll pay attention to the very local context only. So you might as well label at the sentence level. It means you get to label in smaller chunks, and it also makes it easy to spot problems: if you're struggling to make a labelling decision based on just the sentence, the model will likely not be able to learn it either (because it only looks at the local context).

Another thing you could try is combine a text classifier with an NER model and start by predicting whether a paragraph or sentence is about skills or not. This lets you isolate the relevant sections first, and makes it easier for the NER model, because it won't have to deal with all the other random stuff that's typically found in job postings.

Finally, if you haven't tried this already, get a rule-based baseline. Maybe you don't need to train a model at all. See how far you get with a large keyword list, some matcher rules and some clever custom logic (e.g. look for headlines including "skills", "responsibilities" etc. and use that to identify the list of skills, then match on that etc.). No matter what you're doing, this is the baseline you want to beat with your model.

3 Likes

Thanks for the great suggestions!
I have already some logic in the data extraction which filters most of the irrelevant text, but the use of a classifier before the NER seems a good idea!
I am not sure how to use the matcher rules though, are they similar to regex?
Ideally in the end I would like to have a model which could discover new tech skills and languages, for example in the following sentence
Minimum of 1 year of experience in newAwesomeTool is required.
I would like the model to label correctly newAwesomeTool as a skill even if it didn't see that particular word before.

1 Like

Yes, kind of – but instead of just the whole text, they allow you to define rules on the token level and take linguistic annotations (part-of-speech tags, dependencies etc.) into account. The documentation explains this in more detail:

2 Likes

Hi Ines, thank you for the docs very helpful!
I am not a linguistic or nlp expert but I feel that is not so easy to find patterns to extract skills from text sentences like the following sample (from LinkedIn jobs):

  • Top-notch programming skills in PHP and JavaScript / jQuery
  • Working experience in web programming for WordPress and Drupal required
  • Ability to quickly learn new concepts and technologies
  • Excellent communication skills.
  • Object-oriented programming and computer science foundations
  • Software security best practices
  • HTML5, CSS, JSON, XML, AJAX, JavaScript and JavaScript frameworks ( JQuery , Angular, React etc.)
  • Current Web UI frameworks such as Bootstrap and Foundation
  • Relational database design and development
  • Agile methodologies and tools
  • Unit testing
  • 4+ years of experience working in Linux/Unix
  • Good understanding & experience with Performance and Performance tuning for complex S/W projects mainly around large scale and low latency.
  • Experience with leading Design & Architecture
  • Hadoop/Java certifications is a plus

Am I wrong to think that a NER model with manual labelling will perform better in this case?

Also I am bit confused about the labelling part when using the ner.manual approach. Take for example these two sentences:

  • Top-notch programming skills in PHP and JavaScript / jQuery
  • Ability to quickly learn new concepts and technologies

For the first sentence I am confident to label PHP, Javascript and JQuery as skills and accept the example, but I am not so sure about the second sentence. I am interested on labelling hard tech skills, this sentence is about a generic "quick learner" skill but it does not mention specific technologies, so in this case I will not label nothing probably; but should I accept the example with no labels or reject it? I am not sure how the accept/reject actions work when doing manual labelling.
Thank you for any advice!

1 Like

I think the underlying question here is what you define as a skill: "hard tech skills", i.e. technologies, are a pretty good closed category of proper nouns that you can train a named entity recognizer to predict. More abstract phrases like "good team player" are not.

If you're labelling manually, you would not highlight anything here and accept the whole example so you can update the model with the information that this text contains no entities. This is very important – your model needs to see examples of text with entities and texts without entities, so it can learn what is and what isn't an entity.

Well, there are only so many programming languages and it's pretty easy to make comprehensive lists of different technologies and their spelling variations. There are also only so many job-relevant technologies. A few thousand keywords and a Matcher will get you pretty far and you can probably build a reasonably decent prototype of this in like one day.

Training a model does make sense if you want to be able to generalise and detect entities your model hasn't seen before that are used in similar contexts. A text classifier can help you find the most relevant parts if you're dealing with a lot of noise. But training the model is also a much more involved process and less transparent. If you're structuring the problem and label scheme in an unideal way, you can easily spend weeks and end up with a system that performs worse than a handful of regular expressions. So it's really about the trade-offs.

1 Like

Thank you Ines, makes perfect sense!

I would like to have a model which could generalize about tech skills in the IT field (for now) not limited to programming languages (e.g. also Docker, AWS, Git, etc). Basically I don’t want a huge static list of technologies, but something that can extract new technologies from sentences.

1 Like

Hi Ines, following your explanation, is there any case when I should perform a rejection? From what you've said I should use only ACCEPT and IGNORE (for noisy examples) in ner.manual, is this correct?

1 Like

Answers that are "rejected" in the manual mode will be excluded when you train – but you can use it to mark issues with the text, like problematic tokenization that you may want to fix etc. See here for more details:

1 Like

Thanks Ines, I've also watched your reject vs ignore video tutorial.
In my case I have basically 3 possible type of examples:

  1. Sentence with labels -> ACCEPT with labels
  2. Sentence with no labels -> ACCEPT without labels
  3. Noisy and/or confusing text -> IGNORE

So if I got it correctly, noisy or confusing examples should be IGNORED, while examples with processing errors (like wrong tokenization) should be REJECTED. Is that correct?