Spancat data "Boundary tokens are not distinct from the rest of the corpus"

Good day,

I tried to use "spacy debug data" on our spancat dataset, but I got this warning:
"Boundary tokens are not distinct from the rest of the corpus". What does this mean?


hi @joebuckle!

Thanks for your question. I'm not intimately familiar with everything about spaCy but there's some documentation from the spaCy website:

If your pipeline contains a spancat component, then this command (data debug) will also report span characteristics such as the average span length and the span (or span boundary) distinctiveness. The distinctiveness measure shows how different the tokens are with respect to the rest of the corpus using the KL-divergence of the token distributions. To learn more, you can check out Papay et al.’s work on Dissecting Span Identification Tasks with Performance Prediction (EMNLP 2020).

In the code, this comes up if your annotated spans have an average boundary distinctiveness below a set threshold.

Per the paper referenced:

boundary distinctiveness is a measure of how distinctive the starts and ends of spans are. A high boundary distinctiveness indicates that the start and end points of spans are easy to spot, while low distinctiveness indicates smooth transitions.

Essentially, it seems that your spans may have a low (boundary) distinctiveness. What this would indicate is that your spans may not train as well as if they had higher boundary distinctiveness. This isn't a reason to throw away your annotations; it's just a warning sign that you may have issues in training (or said differently, you don't find a lot of improved performance with more annotations as it's hard for the model to predict boundaries as they're not as distinct).

One thing you could do is use spans.correct to see if you can improve the quality of your annotations. I haven't tried this but I would assume if there were some inconsistency in boundaries you can correct, then the boundary distinctiveness would go up (and could improve model performance).

There's a sample project that shows an example of this, which it runs the characteristics and trains the data with both ner and spancat:

If you have further questions, I would recommend posting a question to spaCy discussions forum where the spaCy core team can describe more.