I am trying to add an entity – viz. label=EY <=> “ethnicity” – to an existing NER model using the ner.teach + patterns.jsonl approach. Given our domain is the life sciences, the purpose is to distinguish NORP from true ethnicity mentions that might be biologically or culturally relevant (e.g., “The French [NORP] Academy of Blah, Blah, Blah” vs. “French [EY] patients with Disease X”, resp.). So I created a seed of patterns of the following form:
{"label":"EY","pattern":[{"lower":"african"},{"lower":"-"},{"lower":"american"}]}
{"label":"EY","pattern":[{"lower":"african"},{"lower":"-"},{"lower":"americans"}]}
{"label":"EY","pattern":[{"lower":"african"},{"lower":"american"}]}
{"label":"EY","pattern":[{"lower":"african"},{"lower":"americans"}]}
{"label":"EY","pattern":[{"lower":"african"},{"lower":"british"}]}
{"label":"EY","pattern":[{"lower":"african"}]}
{"label":"EY","pattern":[{"lower":"akan"}]}
{"label":"EY","pattern":[{"lower":"alangan"}]}
{"label":"EY","pattern":[{"lower":"alaskan"}]}
{"label":"EY","pattern":[{"lower":"albanian"},{"lower":"american"}]}
{"label":"EY","pattern":[{"lower":"albanian"},{"lower":"british"}]}
...[and many, many more]
When I run ner.teach using this pattern file with the flag --label EY
, I get basically random examples – meaning, that it isn’t using the pattern matches. The initial model does not have this label (nor even NORP, for that matter), so it makes sense, except that I thought this recipe was supposed to find pattern matches (exactly for this case of adding a new entity type!). Anyhow, on a hunch I changed the label to lower-case like so
{"label":"ey","pattern":[{"lower":"african"},{"lower":"-"},{"lower":"american"}]}
{"label":"ey","pattern":[{"lower":"african"},{"lower":"-"},{"lower":"americans"}]}
{"label":"ey","pattern":[{"lower":"african"},{"lower":"american"}]}
{"label":"ey","pattern":[{"lower":"african"},{"lower":"americans"}]}
{"label":"ey","pattern":[{"lower":"african"},{"lower":"british"}]}
{"label":"ey","pattern":[{"lower":"african"}]}
{"label":"ey","pattern":[{"lower":"akan"}]}
{"label":"ey","pattern":[{"lower":"alabamian"}]}
{"label":"ey","pattern":[{"lower":"alangan"}]}
{"label":"ey","pattern":[{"lower":"alaskan"}]}
{"label":"ey","pattern":[{"lower":"albanian"},{"lower":"american"}]}
{"label":"ey","pattern":[{"lower":"albanian"},{"lower":"british"}]}
...[and many, many more]
and suddenly the examples I was presented with for accept/reject/skip decisions all seemed reasonable. The problem is that I don’t want the label to be lower-case, and also the labels that were accepted are all “ey” and those rejected are “EY” (rejected “EY”, that is – I forgot to change the flag to --label ey
and left it as --label EY
). Or maybe it was the other way around; I can’t remember.
So my questions are (1) Why is this happening?, (2) Should this be happening? and (3) How do I make it work with upper-case labels? (It isn’t the number of characters in the label, either, b/c I tried with “ETHNICITY” and it also presented random predictions in the ner.teach session.)