The --unsegmented
flag only means that Prodigy won't apply the sentence segmenter to split your texts into sentences. If your examples are already pre-segmented, this is fine – but if your data contains lots of really long texts, you probably want to split them, because otherwise training may be slow and the long texts may throw off the model. So it should be fine in your case.
Ahh, I meant examples that consist of only one token. So basically, where "text"
has only one word. Do you find any of those as well?
Do any of the entity spans you've annotated start or end on whitespace characters? In spaCy v2.1, it's now "illegal" for the named entity recognizer to predict entities that start or end with whitespace, or consist of only whitespace. For example, "\n"
, but also "hello\n"
. This should be a really helpful change, because those entities are pretty much always wrong, and making them "illegal" limits the options and moves the entity recognizer towards correct predictions. But it also means that if you data contains training examples like this, you probably want to remove or fix them.