Extracting useful information from Job description

ryanwesslen · January 24, 2023, 6:52pm

Thanks for your question and welcome to the Prodigy community

Typically, yes. The docs cover this:

If your documents are longer than a few hundred words each, we recommend applying the annotations to smaller sections of the document. Often paragraphs work well. Breaking large documents up into chunks lets the annotator focus on smaller pieces of text at a time, which helps them move through the data more consistently. It also gives you finer-grained labels : you get to see which paragraphs were marked as indicating a label, which makes it much easier to review the decisions later.

It's important to know that manual recipes do not do sentence segmentation be default, teach and correct recipes will do sentence segmentation by default. You can also use split_sentences too.

What suggester function are you using? If you're using the default ngrams then yes, this can drastically slow down training, especially on very long text.

Also this post suggests ways to break down the problem:

Also, try running data debug (e.g., after running data-to-spacy) to get stats about your spans lengths:

Sorry, I don't understand this comment. Are you asking whether it is better to use spancat instead of ner? If so, the answer is it "depends".

(Also, that post was originally from 2019 and was before spancat was introduced.)

There's no hard rule but the Prodigy docs give you some idea:

Named Entity Recognition	Span Categorization
spans are non-overlapping syntactic units like proper nouns (e.g. persons, organizations, products)	spans are potentially overlapping units like noun phrases or sentence fragments
model predicts single token-based tags like `B-PERSON` with one tag per token	model predicts scores and labels for suggested spans
takes advantage of clear token boundaries	less sensitive to exact token boundaries

A good rule of thumb is if the meaning of your spans changes if you were to rearrange the words, then use ner. Alternatively, if you have more flexibility that you can rearrange the words within a span and it can have generally the same meaning, then spancat may work better.

Also, if you haven't seen, we have a case study project where we compared ner vs. spancat performance.

Yes, see this post:

FYI, for spaCy specific questions (e.g., training, compute, GPU), I recommend searching/using the spaCy GitHub discussions forum.

Topic		Replies	Views
Parsing/Identifying sections in job descriptions usage , ner , custom	3	3258	June 16, 2022
Questions about ner.teach and ner.correct usage , ner	10	379	January 11, 2024
Using the NER_manual interface to annotate text classification usage , textcat , front-end	4	414	September 14, 2022
Prodigy NER Long Text? usage , ner , textcat	3	622	August 6, 2021
Sentence / long spans classification tasks with context	2	285	March 15, 2024

Extracting useful information from Job description

Related topics