Taking meta data into consideration when training/extracting Named Entities?

Thanks for your question! This thread on training the entity recognizer with custom attributes discusses solutions for a similar problem (although slightly more complex). If you haven't seen it already, you might also find this video useful, which discusses spaCy's NER model architecture in detail.

By default, the entity recognizer uses the lexical attributes NORM, PREFIX, SUFFIX and SHAPE as features. So actually incorporating custom attributes will take some work, because you'll have to use your own fork of the model. A simpler solution could be to try "hijacking" one of those features, e.g. the SHAPE, and add your meta to it:

from spacy.attrs import SHAPE

get_shape = nlp.vocab.lex_attr_getters[SHAPE]
nlp.vocab.lex_attr_getters[SHAPE] = lambda string: get_shape(string) + add_meta(string)
for lex in nlp.vocab:
    # Update any cached values
    lex.shape_ = nlp.vocab.lex_attr_getters[SHAPE](lex.orth_)

The downside of this approach is that the other components of the pre-trained models won't work out-of-the-box anymore, since the parser, tagger and entity recognizer all use the same features. (For example, if the shape is all different and uses your meta data, the tagger and parser will likely predict nonsense.) But if this is not important for your use case, you can definitely give this a try.

Yes, that's also an option! The only thing that's important is that your training and runtime inputs match – but it sounds like your application gives you a lot of flexibility here, because you can control the format of the video titles you read in. So you'd only have to make sure that your application agrees on a consistent scheme of, say, [CHANNEL] – [USER] – [TITLE] and always converts the data data you'll use the model on at runtime to this format. If you need more context, you could even try incorporating the video description – assuming it's possible to scrape that from YouTube and the data is not too noisy and actually relevant. (If the descriptions mostly consist of links and a bunch of SUBSCRIBE PLZ!!1!, it might be counterproductive – but you never know.)

Another thing that sometimes helps is to chain models together – for example, start off by training a text classifier to pre-select the examples. Even fairly subjective distinctions like "stuff I want" vs. "stuff I don't want" can often work pretty well. Ultimately, it all comes down to experimenting with what works best on your specific data – we hope Prodigy makes it easy to try out different approaches quickly. Your use case sounds very cool, so definitely keep us updated on the progress! :+1: