Composite entity/phrase chunks - best practices?

meatball_nlp · July 1, 2020, 5:50pm

Hello,

I am working on building a model that detects the age of people. This could take many different forms such as:

people who are 35
people who are between 30 and 40
people under 45
people in their thirties
people in their 40s

I know I can get the numbers easily, but what would be the best practice for making this somewhat consistent. Using an EntityRuler/PhraseMatcher with custom attributes is what I'm thinking but curious what the consensus is here.

Ideally, I would do some post processing that converts the value to a tuple like (30,39) for 30s, etc. but I'm guessing the model should tell me something like 30 is the entity and is_decade or is_bottom_bound for the additional info.

Thanks so much!

honnibal · July 6, 2020, 10:27am

This is a good example of a problem that feels like it should be easier than it is. Another thing you could consider using is the dependency matcher, which you can read more about here: http://markneumann.xyz/blog/dependency_matcher/

Overall I think you probably want to have rules for this, as I hope there's only a limited number of constructions that express what you're interested in. So I think you should be building yourself a set of test-cases you trust to express the phenomenon, along with confusion cases. You might want to depart from the normal ML-based view of this and have these as a test suite you expect to get 100% on (and so you can also see the specific cases that fail).

Something that might help, depending on the domain, is to use a text-classification preprocess at the sentence level. This might help you discard a lot of irrelevant sentences that mention numbers. This would be the way to go if you find yourself writing rules like "If the sentence mentions these words somewhere, exclude it".

meatball_nlp · July 6, 2020, 1:17pm

Thanks so much for the reply! Will look into all these suggestions

Topic		Replies	Views
PhraseMatcher or the EntityRuler? off-topic	0	404	October 27, 2020
sequence labelling with prodigy ? usage	2	625	February 27, 2018
Segmenting text into blocks containing similar content usage	3	353	June 8, 2021
[Request] best practice for bootstrapping data for training partially new Named Entites? (and a question about PhraseMatcher ) usage , ner , spacy , best-practices , training	3	295	February 16, 2024
Training NER model from scratch using (forward-looking) patterns usage	8	690	December 17, 2019

Composite entity/phrase chunks - best practices?

Related topics