How important is the actual labeled word in NER?

Hi there,

Although I am accustomed to computer science I am rather new to machine learning - so forgive me if my question sounds a bit silly.
In NER, how important is the actual word (and the characters it consists of) that has been labeled?
Does most of the magic in NER depend on the words surrounding the labeled word ?
I am asking this since I would like to find an easier way to synthesize annotation text.
(Maybe just use the word "PERSON" whenever there should be a person instead of making up names)

Thanks in advance

Hi! This is a fair question :slightly_smiling_face: If you're doing NER, you're typically making predictions about a token given the token's features and the surrounding tokens on either side. Different model implementations can use different context windows (tokens on either side to consider) and different strategies for building the token representations.

For instance, spaCy's default NER implementation will take the token's norm (normalised text), prefix and suffix (n characters at the start and end of the token text), the token's shape (abstract representation of character features, like Xxxx) and, if available, the token's word vector. The features of the tokens that are part of an entity can hold very important clues – for instance, in English, capitalisation is a strong indicator for many entity types. Or a token with the norm amazon is somewhat likely to be an entity, and even more likely if it's surrounded by certain other tokens.

tl;dr: Yes, the tokens inside an entity and their features matter just like the tokens surrounding the entity. So you should use actual names in your synthetic data and not just a generic string like "PERSON" (which would give you amazing results, but a model that's likely pretty useless in practice.)

Btw, if you're interested in more details on spaCy's NER model implementations and the ideas behind it, check out this video:

1 Like

Hi Ines,

thank you very much for the quick and helpful reply!
This makes a lot of sense.
I will probably need to spend more time on creating synthetic data than I thought :slightly_smiling_face:
Thank you also for the hint with the video - I will definitely have a look at it.

1 Like