ner.batch-train with new data on existing model


I have trained the model “de_core_news_sm” with two entities (ORG and MONEY) in the dataset “ds_1” and as a result I got a new model “de_core_news_sm_blp”. Now I have more data and I want to improve “de_core_news_sm_blp”. What is the best way to do so?

My process:

  1. ner.teach with the new data stored in “ds_1” and using the “de_core_news_sm_blp” model
  2. ner.batch-train using the new expanded dataset “ds_1” and again from scratch using “de_core_news_sm”

What would be an alternative or better approach?


Hi! Your workflow sounds good, I think that’s pretty much exactly what I would have suggested :+1: Training from the same “base model” is definitely good, because it’ll let you avoid random side-effects from making lots of small updates to the same weights.

If you want to be extra safe, you could consider using a different dataset for the new annotations and then merging the two sets once you’re ready to train. It’s always easy to merge two datasets into one later, but it’s more annoying to separate a single dataset and remove examples if you’ve made a mistake.