How to Restrict Nested Spans of Same Label Type in Spancat

Hi,

I've had success training a span categorizer using data annotated in prodigy. The nested spans were one reason to use it, and getting an output with 'fatty liver' as a problem and 'liver' as a body location has been great.

However, there are occasions where e.g. 'pre-diabetes' will return 'pre-diabetes' as a problem as well as 'diabetes' as a problem. No data was annotated this way, and I would prefer the output in this and other cases was just the longer span.

Is there a way to enable this behavior? I know the spancat component has a max_positive parameter, but I don't want to restrict the former case, just the latter.

Thank you!

Hi! I think in that case, the easiest solution would be to add a custom rule-based component that checks for overlapping spans in the doc.spans that have the same label and removes the shorter spans if necessary. This is a lot more straightforward, maintainable and reliable than trying to mess with the model weights and potentiall introducing other unintended side-effects.

Thanks, Ines. Something like the code below is a pretty simple workaround. A couple questions:

  • Is it possible to drop a span? I can capture the overlapping span and start value to remove it from a table where I unpack the spans, but to make the logic work as a component I think I'd need to modify the doc?

  • For annotation purposes, with the approach you suggested I'm assuming I'd need to get predictions on my text sample and then use the predictions in spans.manual; this approach would be incompatible with spans.correct?

# iterate through the spans
for i in doc.spans['sc']:
    # identify any nested spans of same label type
    if len([x for x in doc.spans['sc'] if (i.label_==x.label_) and (i.start>=x.start) and (i.end<=x.end)])>1:
        # prune span

Yes, you can write to the doc.spans["sc"] and replace it with a list of filtered Span objects.

The spans.correct recipe will show you whatever the pipeline produces in the doc.spans – so if you first run your trained spancat component, followed by your rule-based component, you will only see the filtered spans.

1 Like

So should this come up for someone else, I think the fix here is, working from Ines's guidance, to create a function to resolve the overlapping spans of the same label type

from spacy.language import Language

@Language.component("overlapping_span_filter")
def overlapping_span_filter(doc):
    """
    rule-based component that checks for overlapping spans 
    """
    # create lists to append valid spans and corresponding scores 
    not_overlapping_spans = []
    scores_for_nos = []
    #iterate through the spans and scores
    for span, span_score in zip(doc.spans['sc'],doc.spans['sc'].attrs["scores"]):
        # identify any nested spans of same label type
        if len([x for x in doc.spans['sc'] if (span.label_==x.label_) and (span.start>=x.start) and (
            span.end<=x.end)])==1:
            # append to list
            not_overlapping_spans.append(span)
            scores_for_nos.append(span_score)
        else:
            pass
    #write spans and scores
    doc.spans['sc'] = not_overlapping_spans
    doc.spans['sc'].attrs["scores"] = scores_for_nos
    return doc

Then modify the spans.correct recipe so that you load the component after the spancat component in the spacy model

nlp = spacy.load(spacy_model)
nlp.add_pipe("overlapping_span_filter", after="spancat")
1 Like