hi @Aditya_varma_10!
Thanks for your question and welcome to the Prodigy community
Typically, yes. The docs cover this:
If your documents are longer than a few hundred words each, we recommend applying the annotations to smaller sections of the document. Often paragraphs work well. Breaking large documents up into chunks lets the annotator focus on smaller pieces of text at a time, which helps them move through the data more consistently. It also gives you finer-grained labels : you get to see which paragraphs were marked as indicating a label, which makes it much easier to review the decisions later.
It's important to know that manual recipes do not do sentence segmentation be default, teach
and correct
recipes will do sentence segmentation by default. You can also use split_sentences
too.
What suggester function are you using? If you're using the default ngrams
then yes, this can drastically slow down training, especially on very long text.
Also this post suggests ways to break down the problem:
Also, try running data debug
(e.g., after running data-to-spacy
) to get stats about your spans lengths:
Sorry, I don't understand this comment. Are you asking whether it is better to use spancat
instead of ner
? If so, the answer is it "depends".
(Also, that post was originally from 2019 and was before spancat
was introduced.)
There's no hard rule but the Prodigy docs give you some idea:
Named Entity Recognition | Span Categorization |
---|---|
spans are non-overlapping syntactic units like proper nouns (e.g. persons, organizations, products) | spans are potentially overlapping units like noun phrases or sentence fragments |
model predicts single token-based tags like B-PERSON with one tag per token |
model predicts scores and labels for suggested spans |
takes advantage of clear token boundaries | less sensitive to exact token boundaries |
A good rule of thumb is if the meaning of your spans changes if you were to rearrange the words, then use ner
. Alternatively, if you have more flexibility that you can rearrange the words within a span and it can have generally the same meaning, then spancat
may work better.
Also, if you haven't seen, we have a case study project where we compared ner
vs. spancat
performance.
Yes, see this post:
FYI, for spaCy specific questions (e.g., training, compute, GPU), I recommend searching/using the spaCy GitHub discussions forum.