Extracting useful information from Job description


I am trying to extract some information from job descriptions such as company name, location, responsibilities, required skills etc.

I have gone through the post


Where it is suggested to use a text classifier to classify sentences or paragraphs and ner on top of that.

I have few doubts regarding this approach.

  1. In order to classify sentences should I write seperate script to divide the whole document into sentences and create the dataset where each sentence in the document is single input for text classification to annotate using prodigy .
  2. Is it better to use spancat instead of above approach to annotate sentence as both spans and entities.

I tried training a spancat model on whole document and the spans extend from 1 word to 20 words per span. I tried training the model and observed that training is taking too long to complete even after gpu.

Is there a better way to speed up the training process?

If the training is taking too long is there a way to resume the training from last checkpoint in case the training is stopped before completion.

hi @Aditya_varma_10!

Thanks for your question and welcome to the Prodigy community :wave:

Typically, yes. The docs cover this:

If your documents are longer than a few hundred words each, we recommend applying the annotations to smaller sections of the document. Often paragraphs work well. Breaking large documents up into chunks lets the annotator focus on smaller pieces of text at a time, which helps them move through the data more consistently. It also gives you finer-grained labels : you get to see which paragraphs were marked as indicating a label, which makes it much easier to review the decisions later.

It's important to know that manual recipes do not do sentence segmentation be default, teach and correct recipes will do sentence segmentation by default. You can also use split_sentences too.

What suggester function are you using? If you're using the default ngrams then yes, this can drastically slow down training, especially on very long text.

Also this post suggests ways to break down the problem:

Also, try running data debug (e.g., after running data-to-spacy) to get stats about your spans lengths:

Sorry, I don't understand this comment. Are you asking whether it is better to use spancat instead of ner? If so, the answer is it "depends".

(Also, that post was originally from 2019 and was before spancat was introduced.)

There's no hard rule but the Prodigy docs give you some idea:

Named Entity Recognition Span Categorization
spans are non-overlapping syntactic units like proper nouns (e.g. persons, organizations, products) spans are potentially overlapping units like noun phrases or sentence fragments
model predicts single token-based tags like B-PERSON with one tag per token model predicts scores and labels for suggested spans
takes advantage of clear token boundaries less sensitive to exact token boundaries

A good rule of thumb is if the meaning of your spans changes if you were to rearrange the words, then use ner. Alternatively, if you have more flexibility that you can rearrange the words within a span and it can have generally the same meaning, then spancat may work better.

Also, if you haven't seen, we have a case study project where we compared ner vs. spancat performance.

Yes, see this post:

FYI, for spaCy specific questions (e.g., training, compute, GPU), I recommend searching/using the spaCy GitHub discussions forum.