The effect of segmentation on NER training

wpm · March 20, 2018, 4:28pm

Right now I’m gathering annotation for my models using prodigy ner.teach --unsegmented. I use the unsegmented switch because my documents are relatively short (a couple of paragraphs) and it helps to see the entire context when making annotation.

When I train a spaCy model on these annotations, what segmentation does it use? Does it learn to predict the annotated spans within the context of the entire document, or does it break the documents down into smaller segments such as sentences? Should I segment the annotated documents myself (maybe only train on sentences containing annotated spans) or will spaCy do the right thing?

honnibal · March 21, 2018, 12:17pm

spaCy’s NER runs over the whole document. It’s aware of the is_sent_start attribute, and won’t predict an entity that crosses a sentence boundary.

I would say it won’t matter much. I tried to make the NER less sensitive to these things (which is why I used CNN instead of BiLSTM). It mostly just looks at a small window around the entity.

Topic		Replies	Views
Sentence Segmentation and Annotations usage , spacy , legal	2	1544	January 23, 2020
Document-level annotations with Prodigy usage , ner , spacy , solved	3	799	March 28, 2021
Questions about ner.teach and ner.correct usage , ner	10	379	January 11, 2024
Is it possible to let a model learn segmentation? usage , ner , spacy	3	816	January 8, 2019
Strange text segmentation with ner.teach recipe usage	7	596	September 9, 2019

The effect of segmentation on NER training

Related topics