Thanks a lot @honnibal - I modified that to not worry about lex.is_alpha
or lex.is_lower
, and terms.teach
is now doing a great job suggesting multi-word tokens in my custom model's vocabulary that are made of merged entities (band and musician names, in my case). I went through and quickly generated a few hundred annotations that I exported as a patterns file. However, when I then tried to use the patterns file with ner.teach
, it seemed to be suggesting everything except what I wanted for my label BAND
. Here is the format of entries in my patterns file:
{"label":"BAND","pattern":[{"lower":"LCD Soundsystem"}]}
{"label":"band","pattern":[{"lower":"Pulp"}]}
{"label":"band","pattern":[{"lower":"Gary Numan"}]}
As I go through my source file, ner.teach
is suggesting plenty of multi-word spans - they seem to be spans that are not merged multi-word tokens but rather spans of multiple tokens. The model I am using with ner.teach
uses the same pipeline that preprocessed all my text before I created my custom vectors, so LCD Soundsystem
should be treated as one token when it appears (along with all other band and musician names in my vocabulary). I went through over 500 annotations with ner.teach
using this patterns file and rejected every single one - it appears to be offering me every token (and many multi-token spans) in my source text except for the multi-word tokens representing band names. I can't imagine that's the expected behavior. I tried this solution offered here, but since it didn't make a difference I'm guessing this has been addressed already (I'm using Prodigy v1.8.4).
Here are my hypotheses for what might be going wrong - please let me know if one of these sound right to you or if you have other ideas:
- There is some issue with me adding my custom vectors to
en_core_web_sm
and I should just add them to a blank model (seems unlikely since there are no existing vectors that they might be clashing with). - There is an issue with me trying to represent musician names (which could also be labeled
PERSON
) as well as band names (which are mostly being picked up by the NER part of my pipeline but mislabeled sinceBAND
doesn't exist yet) with the same NER labelBAND
. I don't think this would result in both of thoseBAND
"types" to be ignored duringner.teach
though given that there are hundreds of examples of both in my patterns file. - Maybe there is something wrong with the format of my patterns file and I need to modify
ner.to-patterns
. The above example looks good to me though - my source file is a corpus of music journalism, so bands are consistently punctuated and capitalized. The example ofLCD Soundsystem
should always be the way this appears in my corpus and the pipeline should preprocess it to be one token.
It seems like ner.teach
should have at least accidentally suggested a band name at this point, and the fact that it appears to be skipping over the actual band names is really confusing. I'm at a loss as to what's happening here, so I'd appreciate your help.