NER performance dropping when adding new entities

I have been having trouble combining my entities into one model that performs better than each one individually. For example when training on STATE alone I can achieve 95% accuracy, for City 90% but when I try to combine them performance drops to 85%. You would think that these entities would help each other, ie. STATE usually comes after CITY…

My process looks something like the following:

  1. create a pattern files, CITY_PATTERN, STATE_PATTERN
  2. use ner.teach with the STATE_PATTERN to create annotations.
  3. Train a new model called STATE_MODEL from en_core_web_lg
  4. Use ner.teach with CITY_PATTERN and base model STATE_MODEL
  5. train a new model called CITY_STATE_MODEL using both sets of annotations and base model en_core_web_lg

Any suggestions for changing my process or ideas why performance may degrade?

thanks!

Hmm. Some questions:

  1. How many annotations do you have for STATE and CITY?
  2. Do you have dedicated evaluation data, or are you using the cross-fold validation from the two sets?

In the normal en_core_web_lg model, both STATE and CITY are labelled GPE. So it could be that the problem of distinguishing them requires a harder policy than labelling one and not the other. Or maybe there is a problem here — I’m not sure.

Incidentally, have you tried something like the following as an alternate approach? If you’re subtyping types the model already recognises, it might perform better:


def subtype_gpe(nlp, texts, state_words, city_words):
    state_doc = nlp.make_doc(' '.join(state_words))
    city_doc = nlp.make_doc(' '.join(city_words))
    for doc in nlp.pipe(texts):
        for gpe in [ent for ent in doc.ents if ent.label_ == 'GPE']:
            if gpe[-1].text in state_words:
                retype_entity(gpe, 'STATE')
            elif gpe[-1].text in city_words:
                retype_entity(gpe, 'CITY')
            else:
                state_sim = gpe.similarity(state_doc)
                city_sim = gpe.similarity(city_doc)
                if state_sim >= 0.8 and state_sim >= city_sim:
                    retype_entity(gpe, 'STATE')
                elif city_sim >= 0.8:
                    retype_entity(gpe, 'CITY')

Especially if you’re thinking mostly about the U.S., the list of states is very small. You can also get a good list of cities out of the word vectors, or from Wikipedia or Freebase or something. You would then use the existing GPE definition to handle the contextual ambiguities, and then just subtype the GPEs based on your word list. You can also use the word vectors if you worry your word list will be incomplete, or if you want something a bit more quick-and-dirty.

Working with the existing GPE type is nice because you don’t have to teach the model a new entity definition. You can improve the label with ner.teach, and training should be easier.