Right now I’m gathering annotation for my models using prodigy ner.teach --unsegmented
. I use the unsegmented
switch because my documents are relatively short (a couple of paragraphs) and it helps to see the entire context when making annotation.
When I train a spaCy model on these annotations, what segmentation does it use? Does it learn to predict the annotated spans within the context of the entire document, or does it break the documents down into smaller segments such as sentences? Should I segment the annotated documents myself (maybe only train on sentences containing annotated spans) or will spaCy do the right thing?