I mean, in theory, you could label some data manually that annotates the spans lik this, update the model with it and hope that it adjusts to your new concept of MONEY
. The question is whether it’s worth it. If the pre-trained model you’re using was trained with an annotation scheme that considered the currency not part of an entity, all its current weights are based on that policy. It might take a lot of work and data to teach it a very different definition of the entity type MONEY
.
Edit: Just checked and it seems like the annotation scheme does include the currency by default. It just doesn’t seem to be recognised correctly in this case. I can double-check to see how MONEY
was annotated in the corpus we’re using for English.
The alternative would be writing a rule-based component, yes. Here’s an example of something you could do in spaCy by iterating over the entities, looking at the previous token and expanding the span if it has a certain text value:
from spacy.tokens import Span
new_ents = [] # Collect the updated entities here
for ent in doc.ents:
if ent.label_ == "MONEY": # Only look at money entities
prev_token = doc[ent.start - 1]
if prev_token.text in ('SEK', 'USD', 'EUR'): # etc.
# Create a new Span reflecting the expanded entity
new_entity = Span(doc, ent.start - 1, ent.end, label=ent.label)
new_ents.append(new_entity)
else:
new_ents.append(ent)
doc.ents = new_ents
Another thing to think about: What’s your end goal once you have the money entities? Will you be converting them to some type of structured format like {'amount': 113000000, 'currency': 'SEK'}
? If so, it might make more sense to leave the entities the way they are and add custom attributes like ._.currency
to them.
from spacy.tokens import Span
def get_currency(span):
# Take a span (e.g. entity span and if it's MONEY, try to resolve currency)
if span.label_ == "MONEY":
prev_token = doc[ent.start - 1]
prev_token.text in ('SEK', 'USD', 'EUR'):
return prev_token.text
# and so on...
Span.set_extension('currency', getter=get_currency)
Btw, I’ currently writing spaCy docs for v2.1 and want to include a section on combining statistical models with rules. Would you mind if I used your example or something very similar for this?
Also, totally forgot we had a spaCy code example for entity relation extraction for MONEY
entities: see here – looks pretty relevant to you?