I've had success training a span categorizer using data annotated in prodigy. The nested spans were one reason to use it, and getting an output with 'fatty liver' as a problem and 'liver' as a body location has been great.
However, there are occasions where e.g. 'pre-diabetes' will return 'pre-diabetes' as a problem as well as 'diabetes' as a problem. No data was annotated this way, and I would prefer the output in this and other cases was just the longer span.
Is there a way to enable this behavior? I know the spancat component has a max_positive parameter, but I don't want to restrict the former case, just the latter.
Hi! I think in that case, the easiest solution would be to add a custom rule-based component that checks for overlapping spans in the doc.spans that have the same label and removes the shorter spans if necessary. This is a lot more straightforward, maintainable and reliable than trying to mess with the model weights and potentiall introducing other unintended side-effects.
Thanks, Ines. Something like the code below is a pretty simple workaround. A couple questions:
Is it possible to drop a span? I can capture the overlapping span and start value to remove it from a table where I unpack the spans, but to make the logic work as a component I think I'd need to modify the doc?
For annotation purposes, with the approach you suggested I'm assuming I'd need to get predictions on my text sample and then use the predictions in spans.manual; this approach would be incompatible with spans.correct?
# iterate through the spans
for i in doc.spans['sc']:
# identify any nested spans of same label type
if len([x for x in doc.spans['sc'] if (i.label_==x.label_) and (i.start>=x.start) and (i.end<=x.end)])>1:
# prune span
Yes, you can write to the doc.spans["sc"] and replace it with a list of filtered Span objects.
The spans.correct recipe will show you whatever the pipeline produces in the doc.spans – so if you first run your trained spancat component, followed by your rule-based component, you will only see the filtered spans.
So should this come up for someone else, I think the fix here is, working from Ines's guidance, to create a function to resolve the overlapping spans of the same label type
from spacy.language import Language
@Language.component("overlapping_span_filter")
def overlapping_span_filter(doc):
"""
rule-based component that checks for overlapping spans
"""
# create lists to append valid spans and corresponding scores
not_overlapping_spans = []
scores_for_nos = []
#iterate through the spans and scores
for span, span_score in zip(doc.spans['sc'],doc.spans['sc'].attrs["scores"]):
# identify any nested spans of same label type
if len([x for x in doc.spans['sc'] if (span.label_==x.label_) and (span.start>=x.start) and (
span.end<=x.end)])==1:
# append to list
not_overlapping_spans.append(span)
scores_for_nos.append(span_score)
else:
pass
#write spans and scores
doc.spans['sc'] = not_overlapping_spans
doc.spans['sc'].attrs["scores"] = scores_for_nos
return doc
Then modify the spans.correct recipe so that you load the component after the spancat component in the spacy model