new entity

Hi all,

I need to introduce a new entity in my model in which I have to differentiate between NATION (executive branch: USA, Washington, White House) and governmental organizations + functions (state department, secretary of defense etc.) STATE. This are not standard entities, but it's very easy to generate a list of suitable candidate words that are basically all gold entities. The flow

  • first a few manual annotations:
    ner.manual dbase-state blank:en data.txt --label STATE --patterns patterns-state.jsonl
    Unfortunately the option "--patterns" in ner.manual does not work The pattern format is correct (it works for the entities PERSON etc.)

  • a fast training:
    train ner dbase-state en_core_web_sm --out-put model-state-01
    the F-factor is obviously hopeless, but as I understand it you just have to generate some initial examples.

  • now model-in-the loop annotation:
    ner.teach dbase-state model-state-01 data.txt --label STATE --patterns patterns-state.jsonl

ner.teach gives the following error message
ValueError: [E152] The attribute AGENCY is not supported for token patterns. Please use the option validate=True with Matcher, PhraseMatcher, or EntityRuler for more details.

Any suggestion for a simple work around? The easiest is obviously re-using entities that are not used (WORK_OF_ART for STATE, GPE for NATION; although that is likely to mess up the word vectors completely)

thanks, Andreas

The most important question you should ask here is whether a single entity string will always be a NATION or a STATE, or is it ambiguous? If you can, doing the task at the "type" level rather than the "token" level will be very advantageous. The second question you'll need to look at is what to do about metonymy. In news, but also in other genres of text, the names of places are very often used to refer to the bodies that meet there. Washington is used to refer to the US federal government, sometimes Foggy Bottom will be used to refer to the state department, etc. In some genres of news these references get quite baroque in introductory paragraphs to avoid repetitive second mentions.

The problem with metonymy is that the reference is not at all precise. If you have a sentence like "Washington needs to act decisively in this crisis", does the speaker mean the U.S. president, the congress, the executive branch...? The only answer you can make to this question is generally "yes" -- to all of them, none of them, some mix of them. There is no resolution to the question of which entity to link that mention to, because the writer was simply gesturing towards the general concept-space of U.S. government stuff. This is actually one of the reasons metonymy is used as a linguistic resource: it allows exactly this sort of imprecision.

I know this doesn't answer to the specific questions you have about the software, but I think if you back up and re-examine how you're modelling the problem, you'll likely be able to intersect very simple approaches to get good results. But if your model of the problem doesn't match the reality of how people are using these terms, good results will be impossible, because you'll be grading the model on questions to which there is no answer.

Anyway, in terms of the specifics of the software, I think you probably want the ner.manual or ner.correct recipes. The ner.teach binary recipe is still available for backwards compatibility, but in most situations the ner.correct recipe will work better, especially when teaching the model a new category.