Interested in subset of default ORG/PRODUCT/LOCATION entities

I'm facing the following problem that atm try to solve with Prodigy/Spacy.
I'm interested in extracting entities that are provided by the default spacy models (e.g. Org and Location) from news articles. Quite a common scenario with the slight caveat that I want to only extract the main entities that the article is talking about (the entities that are mainly affected etc).
I wouldn't want to consider them as new entities and start with an empty model as the pretrained transformer is already quite good at recognizing those classes. Tried that a bit but the results don't look promising. Could you suggest a direction for this kind of problem?

Hi! This is definitely a good instinct because if you're only interested in a subset of mentions, the model might also struggle to generalise if you just labelled the spans of interest, especially if there's nothing inherent in the context that distinguishes one instance of an ORG that you're interested in from another instance of ORG you're not interested in.

I think it comes down to defining how the fact that an entity is the "main entity that the article is talking about" is represented in the text. Is frequency a good indicator? You could also combine this with entity normalization (e.g. using a rule-based approach) or entity linking (training an entity linking model) if you deal with a lot of different entities and spelling referring to the same concept (e.g. different variations of a company name). If your articles deal with different topics, another approach could be to train a text classifier and start by grouping the texts by topic, and then using that information to determine which entities are likely the most relevant.

1 Like