Extracting useful information from Job description

hi @Aditya_varma_10!

Thanks for your question and welcome to the Prodigy community :wave:

Typically, yes. The docs cover this:

If your documents are longer than a few hundred words each, we recommend applying the annotations to smaller sections of the document. Often paragraphs work well. Breaking large documents up into chunks lets the annotator focus on smaller pieces of text at a time, which helps them move through the data more consistently. It also gives you finer-grained labels : you get to see which paragraphs were marked as indicating a label, which makes it much easier to review the decisions later.

It's important to know that manual recipes do not do sentence segmentation be default, teach and correct recipes will do sentence segmentation by default. You can also use split_sentences too.

What suggester function are you using? If you're using the default ngrams then yes, this can drastically slow down training, especially on very long text.

Also this post suggests ways to break down the problem:

Also, try running data debug (e.g., after running data-to-spacy) to get stats about your spans lengths:

Sorry, I don't understand this comment. Are you asking whether it is better to use spancat instead of ner? If so, the answer is it "depends".

(Also, that post was originally from 2019 and was before spancat was introduced.)

There's no hard rule but the Prodigy docs give you some idea:

Named Entity Recognition Span Categorization
spans are non-overlapping syntactic units like proper nouns (e.g. persons, organizations, products) spans are potentially overlapping units like noun phrases or sentence fragments
model predicts single token-based tags like B-PERSON with one tag per token model predicts scores and labels for suggested spans
takes advantage of clear token boundaries less sensitive to exact token boundaries

A good rule of thumb is if the meaning of your spans changes if you were to rearrange the words, then use ner. Alternatively, if you have more flexibility that you can rearrange the words within a span and it can have generally the same meaning, then spancat may work better.

Also, if you haven't seen, we have a case study project where we compared ner vs. spancat performance.

Yes, see this post:

FYI, for spaCy specific questions (e.g., training, compute, GPU), I recommend searching/using the spaCy GitHub discussions forum.