Composite entity/phrase chunks - best practices?


I am working on building a model that detects the age of people. This could take many different forms such as:

people who are 35
people who are between 30 and 40
people under 45
people in their thirties
people in their 40s

I know I can get the numbers easily, but what would be the best practice for making this somewhat consistent. Using an EntityRuler/PhraseMatcher with custom attributes is what I'm thinking but curious what the consensus is here.

Ideally, I would do some post processing that converts the value to a tuple like (30,39) for 30s, etc. but I'm guessing the model should tell me something like 30 is the entity and is_decade or is_bottom_bound for the additional info.

Thanks so much!

This is a good example of a problem that feels like it should be easier than it is. Another thing you could consider using is the dependency matcher, which you can read more about here:

Overall I think you probably want to have rules for this, as I hope there's only a limited number of constructions that express what you're interested in. So I think you should be building yourself a set of test-cases you trust to express the phenomenon, along with confusion cases. You might want to depart from the normal ML-based view of this and have these as a test suite you expect to get 100% on (and so you can also see the specific cases that fail).

Something that might help, depending on the domain, is to use a text-classification preprocess at the sentence level. This might help you discard a lot of irrelevant sentences that mention numbers. This would be the way to go if you find yourself writing rules like "If the sentence mentions these words somewhere, exclude it".

1 Like

Thanks so much for the reply! Will look into all these suggestions