How important is the actual labeled word in NER?

Balo · March 30, 2020, 5:21pm

Hi there,

Although I am accustomed to computer science I am rather new to machine learning - so forgive me if my question sounds a bit silly.
In NER, how important is the actual word (and the characters it consists of) that has been labeled?
Does most of the magic in NER depend on the words surrounding the labeled word ?
I am asking this since I would like to find an easier way to synthesize annotation text.
(Maybe just use the word "PERSON" whenever there should be a person instead of making up names)

Thanks in advance

ines · March 31, 2020, 9:37am

Hi! This is a fair question If you're doing NER, you're typically making predictions about a token given the token's features and the surrounding tokens on either side. Different model implementations can use different context windows (tokens on either side to consider) and different strategies for building the token representations.

For instance, spaCy's default NER implementation will take the token's norm (normalised text), prefix and suffix (n characters at the start and end of the token text), the token's shape (abstract representation of character features, like Xxxx) and, if available, the token's word vector. The features of the tokens that are part of an entity can hold very important clues – for instance, in English, capitalisation is a strong indicator for many entity types. Or a token with the norm amazon is somewhat likely to be an entity, and even more likely if it's surrounded by certain other tokens.

tl;dr: Yes, the tokens inside an entity and their features matter just like the tokens surrounding the entity. So you should use actual names in your synthetic data and not just a generic string like "PERSON" (which would give you amazing results, but a model that's likely pretty useless in practice.)

Btw, if you're interested in more details on spaCy's NER model implementations and the ideas behind it, check out this video:

Balo · March 31, 2020, 5:59pm

Hi Ines,

thank you very much for the quick and helpful reply!
This makes a lot of sense.
I will probably need to spend more time on creating synthetic data than I thought
Thank you also for the hint with the video - I will definitely have a look at it.

Topic		Replies	Views
NER Model Features ner , spacy , api	2	655	June 1, 2018
NER with commas in the word through ner.correct	1	381	September 12, 2022
Framing NER task as a text classification task usage , ner , textcat	5	633	December 19, 2019
Recommendation for creating new special token usage , spacy , solved	3	696	March 26, 2018
Expanding NER to include neighbouring tokens usage , ner , spacy , finance	3	1692	February 5, 2019

How important is the actual labeled word in NER?

Related topics