The effect of segmentation on NER training

Right now I’m gathering annotation for my models using prodigy ner.teach --unsegmented. I use the unsegmented switch because my documents are relatively short (a couple of paragraphs) and it helps to see the entire context when making annotation.

When I train a spaCy model on these annotations, what segmentation does it use? Does it learn to predict the annotated spans within the context of the entire document, or does it break the documents down into smaller segments such as sentences? Should I segment the annotated documents myself (maybe only train on sentences containing annotated spans) or will spaCy do the right thing?

1 Like

spaCy’s NER runs over the whole document. It’s aware of the is_sent_start attribute, and won’t predict an entity that crosses a sentence boundary.

I would say it won’t matter much. I tried to make the NER less sensitive to these things (which is why I used CNN instead of BiLSTM). It mostly just looks at a small window around the entity.