improve custom NER model accuracy

we trained a custom NER model from blank:en, got about 73% validation accuracy, but when testing, the model seems learn some pattern wrongly:

  1. when text is "135dea28e4f8 distribute frankly simon simon england" it predict "135dea28e4f8" as an entity, but after remove last few words, remain "135dea28e4f8 distribute ", it doesn't make "135dea28e4f8" as an entity
  2. for annotations, a word depends on its Semantic environment sometimes it's an entity, sometimes not
    Is there anything we can do to improve the performance

Hi! If you have examples of concepts and patterns that your model currently gets wrong, a great way to improve it is to include examples like this (correctly annotated) in the training data. For instance, sentences prefixed with a random ID where the number is consistently annotated as an entity (or not, depending on the behaviour you're looking for). Similarly, if you have ambiguous concepts ("apple" the company vs. "apple" the fruit), you can include more examples of those in different contexts.

Are you using any pretrained embeddings, like word vectors etc.? This can often give you a significant boost in accuracy because your model starts off with at least some concept of the words in your data. For a quick experiment, just try using en_core_web_lg or en_vectors_web_lg as the base model instead of just blank:en, and see how that improves your results.

(Later, you could also use data-to-spacy to export your annotations and test spacy-nightly. It's still a pre-release and you'll have to use a separate Python environment for it, but you'll be able to experiment with initialising your model with transformer embeddings, which could improve your results even further. The #1 thing to focus on IMO is still the data, though – if you know what the model is getting wrong, there's a great opportunity here to give it more examples it can learn from.)

2 Likes