Multi-word entity seeding, entity context

Thanks a lot @honnibal - I modified that to not worry about lex.is_alpha or lex.is_lower, and terms.teach is now doing a great job suggesting multi-word tokens in my custom model's vocabulary that are made of merged entities (band and musician names, in my case). I went through and quickly generated a few hundred annotations that I exported as a patterns file. However, when I then tried to use the patterns file with ner.teach, it seemed to be suggesting everything except what I wanted for my label BAND. Here is the format of entries in my patterns file:

{"label":"BAND","pattern":[{"lower":"LCD Soundsystem"}]}
{"label":"band","pattern":[{"lower":"Pulp"}]}
{"label":"band","pattern":[{"lower":"Gary Numan"}]}

As I go through my source file, ner.teach is suggesting plenty of multi-word spans - they seem to be spans that are not merged multi-word tokens but rather spans of multiple tokens. The model I am using with ner.teach uses the same pipeline that preprocessed all my text before I created my custom vectors, so LCD Soundsystem should be treated as one token when it appears (along with all other band and musician names in my vocabulary). I went through over 500 annotations with ner.teach using this patterns file and rejected every single one - it appears to be offering me every token (and many multi-token spans) in my source text except for the multi-word tokens representing band names. I can't imagine that's the expected behavior. I tried this solution offered here, but since it didn't make a difference I'm guessing this has been addressed already (I'm using Prodigy v1.8.4).

Here are my hypotheses for what might be going wrong - please let me know if one of these sound right to you or if you have other ideas:

  • There is some issue with me adding my custom vectors to en_core_web_sm and I should just add them to a blank model (seems unlikely since there are no existing vectors that they might be clashing with).
  • There is an issue with me trying to represent musician names (which could also be labeled PERSON) as well as band names (which are mostly being picked up by the NER part of my pipeline but mislabeled since BAND doesn't exist yet) with the same NER label BAND. I don't think this would result in both of those BAND "types" to be ignored during ner.teach though given that there are hundreds of examples of both in my patterns file.
  • Maybe there is something wrong with the format of my patterns file and I need to modify ner.to-patterns. The above example looks good to me though - my source file is a corpus of music journalism, so bands are consistently punctuated and capitalized. The example of LCD Soundsystem should always be the way this appears in my corpus and the pipeline should preprocess it to be one token.

It seems like ner.teach should have at least accidentally suggested a band name at this point, and the fact that it appears to be skipping over the actual band names is really confusing. I'm at a loss as to what's happening here, so I'd appreciate your help.