Composite entity/phrase chunks - best practices?


I am working on building a model that detects the age of people. This could take many different forms such as:

people who are 35
people who are between 30 and 40
people under 45
people in their thirties
people in their 40s

I know I can get the numbers easily, but what would be the best practice for making this somewhat consistent. Using an EntityRuler/PhraseMatcher with custom attributes is what I'm thinking but curious what the consensus is here.

Ideally, I would do some post processing that converts the value to a tuple like (30,39) for 30s, etc. but I'm guessing the model should tell me something like 30 is the entity and is_decade or is_bottom_bound for the additional info.

Thanks so much!

This is a good example of a problem that feels like it should be easier than it is. Another thing you could consider using is the dependency matcher, which you can read more about here:

Overall I think you probably want to have rules for this, as I hope there's only a limited number of constructions that express what you're interested in. So I think you should be building yourself a set of test-cases you trust to express the phenomenon, along with confusion cases. You might want to depart from the normal ML-based view of this and have these as a test suite you expect to get 100% on (and so you can also see the specific cases that fail).

Something that might help, depending on the domain, is to use a text-classification preprocess at the sentence level. This might help you discard a lot of irrelevant sentences that mention numbers. This would be the way to go if you find yourself writing rules like "If the sentence mentions these words somewhere, exclude it".

Thanks so much for the reply! Will look into all these suggestions