Expanding NER to include neighbouring tokens

Imagine a sentence like

Earnings after tax (EAT) increased by 34%, amounting to SEK 113 million ( 85 )

The current model might identify 113 million as MONEY entity, but I’d like it to identify SEK 113 million ( 85 ) as the MONEY entity instead. What is the best approach to let the model capture this? I suppose some could be catched by using PhraseMatcher but what would the approach be if I want the model to learn this instead? Otherwise the ner.teach suggests not the whole phrase as a MONEY entity.

I mean, in theory, you could label some data manually that annotates the spans lik this, update the model with it and hope that it adjusts to your new concept of MONEY. The question is whether it’s worth it. If the pre-trained model you’re using was trained with an annotation scheme that considered the currency not part of an entity, all its current weights are based on that policy. It might take a lot of work and data to teach it a very different definition of the entity type MONEY.

Edit: Just checked and it seems like the annotation scheme does include the currency by default. It just doesn’t seem to be recognised correctly in this case. I can double-check to see how MONEY was annotated in the corpus we’re using for English.

The alternative would be writing a rule-based component, yes. Here’s an example of something you could do in spaCy by iterating over the entities, looking at the previous token and expanding the span if it has a certain text value:

from spacy.tokens import Span
new_ents = []  # Collect the updated entities here

for ent in doc.ents:
    if ent.label_ == "MONEY":  # Only look at money entities
        prev_token = doc[ent.start - 1]
        if prev_token.text in ('SEK', 'USD', 'EUR'):  # etc.
            # Create a new Span reflecting the expanded entity
            new_entity = Span(doc, ent.start - 1, ent.end, label=ent.label)
            new_ents.append(new_entity)
        else:
            new_ents.append(ent)

doc.ents = new_ents

Another thing to think about: What’s your end goal once you have the money entities? Will you be converting them to some type of structured format like {'amount': 113000000, 'currency': 'SEK'}? If so, it might make more sense to leave the entities the way they are and add custom attributes like ._.currency to them.

from spacy.tokens import Span

def get_currency(span):
    # Take a span (e.g. entity span and if it's MONEY, try to resolve currency)
    if span.label_ == "MONEY":
        prev_token = doc[ent.start - 1]
        prev_token.text in ('SEK', 'USD', 'EUR'):
            return prev_token.text
       # and so on...

Span.set_extension('currency', getter=get_currency)

Btw, I’ currently writing spaCy docs for v2.1 and want to include a section on combining statistical models with rules. Would you mind if I used your example or something very similar for this?

Also, totally forgot we had a spaCy code example for entity relation extraction for MONEY entities: see here – looks pretty relevant to you?

Hi @ines

Thanks a lot for the helpful answers you always seem to deliver!

Alright. Good to know. I also read this post where you suggest training a new NER model from scratch. I might take that approach as well - at least test and compare. Is there a way to omit some pretrained labels but keep others?

That is exactly what I need to do and I think I will go for the attributes approach indeed.

Sure. Let me know if I can be of any help. I have lots of data. My data is a4 pages like of HTML reports but I have chopped them up by parsing the HTML and then added the parsed content to spacy. Is it possible to give the whole thing to spacy and then do some preprocessing to keep track of the origin of the tokens or should I add that logic as a combination of some custom logic and spacy?

It doesn't get more relevant than that!

1 Like

For a sentence like

Sales decreased by 2 percent and totalled EUR 816 (836) million.

I basically want to produce a headline like Sales EUR 816 million. And similar for

Net income was $9.7 million higher than last year, coming in at $37.5 million.

I want to produce a head line Net income USD 37.5 million. I imagine the hard part is really parsing the dependency tree in a smart way. The current tree looks like this

Do you have any suggestions on where to focus? I remember coming across that you can create your own type of dependencies as well, but I can’t seem to find it now. Maybe that is a better approach?