SPAN or NER for topic identification over large sentences

a-meneghini · November 10, 2022, 6:28am

Hi there, sorry if this had already been discussed. I'm working on a new project, my model should be capable of identifying differents parts within texts of project descriptions, like needs of beneficiaries, aim, actions, expected results. These kind of informations sometimes can be identified within 2-3 words but sometimes can be found in 2-3 long sentences, and overlapping is not expected. I tried with both NER or SPAN but the results are quite similar when I proceed with .manual. I see in the documentation that span should be better for long texts, but it's also more time and resources consuming in terms of training. Is there any reasons why I should prefer span for long texts if there is no overlapping?
Thanks!

koaning · November 11, 2022, 2:04pm

The spancat algorithm and the NER algorithm in spaCy work differently under the hood.

I'm glancing over a lot of details but very roughly: spancat tries to generate potential candidates for a span, which are then judged by a classifier. NER goes about it different; it tries to predict when an entity starts, and then it keeps going until an entity is predicted to stop.

From a gut feeling perspective, this might help explain why the NER model will work better if there is a very clear boundary for an entity. Things like dates would work very well. But for things like below, the boundary is perhaps more fuzzy.

This example was taken from the spancat blogpost, which also shares some more details.

What you describe sounds like a candidate for spancat, but you may also choose to have some entities done by NER and some other ones by spancat. Another reason why some folks appreciate spancat: it allows you to play around with a confidence score. NER does not provide one, but spancat does!

I hope this answer was somewhat useful, but if you're in need for more information, I might suggest asking the same question on our spaCy discussion board. This forum is watched by the Prodigy team, while the spaCy team might have other insights and advice.

a-meneghini · November 13, 2022, 10:34am

Hi Koaning, thanks so much for the explanation - and especially the spancat blogpost. I think I'll go on with spancat, especially because of the confidence score.
Thanks!

Topic		Replies	Views
NER for tagging start and end of span vs. spancat usage , ner , spancat	5	849	July 4, 2023
Prodigy NER Long Text? usage , ner , textcat	3	621	August 6, 2021
Low score in spancat training	11	368	February 14, 2023
Sentence / long spans classification tasks with context	2	277	March 15, 2024
Spancat : surrounding text used as context?	3	362	June 23, 2022

SPAN or NER for topic identification over large sentences

Related topics