Taking meta data into consideration when training/extracting Named Entities?

Is it possible to take meta data into consideration when training and later extracting named entities from text?

For example, if I want to train a model to extract named entities from youtube video titles, the channel that each title belongs to may have an important role in how the title text was constructed (i.e. channel owner has certain, sometimes subtle, styles and patterns they use for their title texts) and therefore effects the understanding of the named entities of the title.

Is there any way to incorporate this meta information (i.e. by channel name or ID) into the training process and later in actual usage for entity extraction? Or should I artificially inject this information into the title text (with a defined pattern, so I can reconstruct the original text after NER)? What’s the best option to use this meta data to improve NER results?

Thanks!

Thanks for your question! This thread on training the entity recognizer with custom attributes discusses solutions for a similar problem (although slightly more complex). If you haven't seen it already, you might also find this video useful, which discusses spaCy's NER model architecture in detail.

By default, the entity recognizer uses the lexical attributes NORM, PREFIX, SUFFIX and SHAPE as features. So actually incorporating custom attributes will take some work, because you'll have to use your own fork of the model. A simpler solution could be to try "hijacking" one of those features, e.g. the SHAPE, and add your meta to it:

from spacy.attrs import SHAPE

get_shape = nlp.vocab.lex_attr_getters[SHAPE]
nlp.vocab.lex_attr_getters[SHAPE] = lambda string: get_shape(string) + add_meta(string)
for lex in nlp.vocab:
    # Update any cached values
    lex.shape_ = nlp.vocab.lex_attr_getters[SHAPE](lex.orth_)

The downside of this approach is that the other components of the pre-trained models won't work out-of-the-box anymore, since the parser, tagger and entity recognizer all use the same features. (For example, if the shape is all different and uses your meta data, the tagger and parser will likely predict nonsense.) But if this is not important for your use case, you can definitely give this a try.

Yes, that's also an option! The only thing that's important is that your training and runtime inputs match – but it sounds like your application gives you a lot of flexibility here, because you can control the format of the video titles you read in. So you'd only have to make sure that your application agrees on a consistent scheme of, say, [CHANNEL] – [USER] – [TITLE] and always converts the data data you'll use the model on at runtime to this format. If you need more context, you could even try incorporating the video description – assuming it's possible to scrape that from YouTube and the data is not too noisy and actually relevant. (If the descriptions mostly consist of links and a bunch of SUBSCRIBE PLZ!!1!, it might be counterproductive – but you never know.)

Another thing that sometimes helps is to chain models together – for example, start off by training a text classifier to pre-select the examples. Even fairly subjective distinctions like "stuff I want" vs. "stuff I don't want" can often work pretty well. Ultimately, it all comes down to experimenting with what works best on your specific data – we hope Prodigy makes it easy to try out different approaches quickly. Your use case sounds very cool, so definitely keep us updated on the progress! :+1:

This is causing a RecursionError: maximum recursion depth exceeded error for me when I read in some text to nlp, i.e. doc = nlp(u'give me an awesome pattern') .

Sorry, I just copied over @honnibal's code from the other thread and didn't test it again. Maybe the function isn't actually bound to get_shape here, so it's setting the lexical attribute getter to a function that calls itself, and so on.

Instead, you should be able to just import the function that creates the shape directly:

from spacy.lang.lex_attrs import word_shape
nlp.vocab.lex_attr_getters[SHAPE] = lambda string: word_shape(string) + add_meta(string)

Or your could just leave out the original shape completely and just replace it (since you're modifying it anyways).

Thanks for the detailed response, that was really helpful.

Yes, since I can format the text in both directions when training and later using the model, this indeed looks like the better option. Been experimenting with this a little and it shows promise.

1 Like

There are added complications if add_meta(string) uses a Matcher. For my use case I was using a pattern file to detect spans that match the pattern and replace their corresponding shape with a special tag. Unfortunately to match the pattern you often need to look at the shape --- leading to another case of infinite recursion!