NER for tagging start and end of span vs. spancat

Thanks for the background!

In general, we tend recommend shorter examples but I can understand if this is a bit tricky.

This is somewhat similar -- here's a quick way to split by a token and then create a new file to load as your source file (replace \xa0 with \n).

However, it doesn't do this under the constraint of keeping the newlines at the end of the doc. I'm wondering if there's a logic you could be as a if statement that would skip splitting when it is at the end.

Back to your original question - I was able to talk with a member of the spaCy dev team who suggested likely spancat would be a better fit. ner doesn't predict entities across sentence boundaries, especially given you have more than 100 spans which that's the case (nor is it easy to drop them).

Hope this helps and let us know if you have further questions!