Extracting useful information from Job description

Aditya_varma_10 · January 20, 2023, 3:23pm

Hey,

I am trying to extract some information from job descriptions such as company name, location, responsibilities, required skills etc.

I have gone through the post

https://support.prodi.gy/t/parsing-identifying-sections-in-job-descriptions/1100

Where it is suggested to use a text classifier to classify sentences or paragraphs and ner on top of that.

I have few doubts regarding this approach.

In order to classify sentences should I write seperate script to divide the whole document into sentences and create the dataset where each sentence in the document is single input for text classification to annotate using prodigy .
Is it better to use spancat instead of above approach to annotate sentence as both spans and entities.

I tried training a spancat model on whole document and the spans extend from 1 word to 20 words per span. I tried training the model and observed that training is taking too long to complete even after gpu.

Is there a better way to speed up the training process?

If the training is taking too long is there a way to resume the training from last checkpoint in case the training is stopped before completion.

ryanwesslen · January 24, 2023, 6:52pm

hi @Aditya_varma_10!

Thanks for your question and welcome to the Prodigy community

Typically, yes. The docs cover this:

If your documents are longer than a few hundred words each, we recommend applying the annotations to smaller sections of the document. Often paragraphs work well. Breaking large documents up into chunks lets the annotator focus on smaller pieces of text at a time, which helps them move through the data more consistently. It also gives you finer-grained labels : you get to see which paragraphs were marked as indicating a label, which makes it much easier to review the decisions later.

It's important to know that manual recipes do not do sentence segmentation be default, teach and correct recipes will do sentence segmentation by default. You can also use split_sentences too.

What suggester function are you using? If you're using the default ngrams then yes, this can drastically slow down training, especially on very long text.

Also this post suggests ways to break down the problem:

Also, try running data debug (e.g., after running data-to-spacy) to get stats about your spans lengths:

Sorry, I don't understand this comment. Are you asking whether it is better to use spancat instead of ner? If so, the answer is it "depends".

(Also, that post was originally from 2019 and was before spancat was introduced.)

There's no hard rule but the Prodigy docs give you some idea:

Named Entity Recognition	Span Categorization
spans are non-overlapping syntactic units like proper nouns (e.g. persons, organizations, products)	spans are potentially overlapping units like noun phrases or sentence fragments
model predicts single token-based tags like `B-PERSON` with one tag per token	model predicts scores and labels for suggested spans
takes advantage of clear token boundaries	less sensitive to exact token boundaries

A good rule of thumb is if the meaning of your spans changes if you were to rearrange the words, then use ner. Alternatively, if you have more flexibility that you can rearrange the words within a span and it can have generally the same meaning, then spancat may work better.

Also, if you haven't seen, we have a case study project where we compared ner vs. spancat performance.

Yes, see this post:

FYI, for spaCy specific questions (e.g., training, compute, GPU), I recommend searching/using the spaCy GitHub discussions forum.

Topic		Replies	Views
Parsing/Identifying sections in job descriptions usage , ner , custom	3	3258	June 16, 2022
Questions about ner.teach and ner.correct usage , ner	10	379	January 11, 2024
Using the NER_manual interface to annotate text classification usage , textcat , front-end	4	414	September 14, 2022
Prodigy NER Long Text? usage , ner , textcat	3	622	August 6, 2021
Sentence / long spans classification tasks with context	2	285	March 15, 2024

Extracting useful information from Job description

Related topics