I need to introduce a new entity in my model in which I have to differentiate between NATION (executive branch: USA, Washington, White House) and governmental organizations + functions (state department, secretary of defense etc.) STATE. This are not standard entities, but it's very easy to generate a list of suitable candidate words that are basically all gold entities. The flow
first a few manual annotations:
ner.manual dbase-state blank:en data.txt --label STATE --patterns patterns-state.jsonl
Unfortunately the option "--patterns" in ner.manual does not work The pattern format is correct (it works for the entities PERSON etc.)
a fast training:
train ner dbase-state en_core_web_sm --out-put model-state-01
the F-factor is obviously hopeless, but as I understand it you just have to generate some initial examples.
now model-in-the loop annotation:
ner.teach dbase-state model-state-01 data.txt --label STATE --patterns patterns-state.jsonl
ner.teach gives the following error message
ValueError: [E152] The attribute AGENCY is not supported for token patterns. Please use the option validate=True with Matcher, PhraseMatcher, or EntityRuler for more details.
Any suggestion for a simple work around? The easiest is obviously re-using entities that are not used (WORK_OF_ART for STATE, GPE for NATION; although that is likely to mess up the word vectors completely)