SPAN or NER for topic identification over large sentences

Hi there, sorry if this had already been discussed. I'm working on a new project, my model should be capable of identifying differents parts within texts of project descriptions, like needs of beneficiaries, aim, actions, expected results. These kind of informations sometimes can be identified within 2-3 words but sometimes can be found in 2-3 long sentences, and overlapping is not expected. I tried with both NER or SPAN but the results are quite similar when I proceed with .manual. I see in the documentation that span should be better for long texts, but it's also more time and resources consuming in terms of training. Is there any reasons why I should prefer span for long texts if there is no overlapping?

The spancat algorithm and the NER algorithm in spaCy work differently under the hood.

I'm glancing over a lot of details but very roughly: spancat tries to generate potential candidates for a span, which are then judged by a classifier. NER goes about it different; it tries to predict when an entity starts, and then it keeps going until an entity is predicted to stop.

From a gut feeling perspective, this might help explain why the NER model will work better if there is a very clear boundary for an entity. Things like dates would work very well. But for things like below, the boundary is perhaps more fuzzy.

This example was taken from the spancat blogpost, which also shares some more details.

What you describe sounds like a candidate for spancat, but you may also choose to have some entities done by NER and some other ones by spancat. Another reason why some folks appreciate spancat: it allows you to play around with a confidence score. NER does not provide one, but spancat does!

I hope this answer was somewhat useful, but if you're in need for more information, I might suggest asking the same question on our spaCy discussion board. This forum is watched by the Prodigy team, while the spaCy team might have other insights and advice.

Hi Koaning, thanks so much for the explanation - and especially the spancat blogpost. I think I'll go on with spancat, especially because of the confidence score.

1 Like