Train curve accuracy getting worse

Hi Ines,

Thanks! Sorry, I do have one more follow up question. After running the ner.train-curve function, I noticed that my model actually gets WORSE with more data. Why do you think that would be the case?

Sincerely,
Sidd

(I hope it’s okay I move this question to a new topic – I think it makes more sense this way and also makes it easier to find for others.)

How many examples do you have? And how much worse is it getting? Also, is the behaviour consistent, i.e. do you see roughly similar results if you run the train-curve recipe several times?

If you’re working with small numbers of examples, it’s more common to see a fluctuation here. If the dataset is really small (like, under 500 examples), it’s kind of unpredictable and very difficult to reason about the results.

If accuracy is decreasing or not increasing with more examples, it’s usually an indication that your data just isn’t very suitable for teaching the model what it’s supposed to learn from it. It could also mean that the label scheme you’ve used isn’t suitable and that the model isn’t able to efficiently learn the distinction. This can happen if the local context around the entity doesn’t have enough clues, or if the labels are too specific. (See Matt’s talk for a more in-depth look at designing label schemes).

Hey Ines,

No problem at all! I had ~8000 examples, but however did not use a separate evaluation set and just used eval split to split off a portion of the training set for evaluation. It went down from 68% to 54% from 25% of the data to 50% of it and then normalized. This only occurred for the en_core_web_sm model though; it did not happen for the md and lg models. But when I tried running train-curve on the sm model again, it did not decrease with more data.

However, as you mentioned, the accuracy isn’t really increasing with more data though. Are there any strategies that I can use to try to figure out the issue with the annotations?

I do have another question that’s a bit of a tangent (maybe this deserves another post, but I’ll let you make that call). I noticed that the accuracy scores for these Spacy models are approximately 85%. If we’re getting a score less than that (batch-train is giving me 63-64%) and are training on top of a pre-existing model, does that mean that catastrophic forgetting is impacting the model? Another score-related question, why am I getting a different score from batch-train and train-curve (am getting 63-64% from the former and getting 72-73% from the latter).

Thanks,
Sidd

Hi Ines,

This might be a stupid question, but I’m assuming the model we use for annotation has to be the same model we use to train right? Right now, I annotated with the sm model and have tried training with md and lg. If we want to try new models (md and lg) for training, we’d need to create new annotation sets with those new models being used for annotation, correct?

Thanks,
Sidd

Hi Sidd,

You definitely don’t need to create new annotations to train a new model! In fact after you’ve collected a bit of data, you might find it’s better to start from a blank model. The reason is that the pre-trained model might start out with entities not relevant to your problem, which can get in the way. Try starting with the en_vectors_web_lg model and see how you go.

I also think you’ll benefit from creating a stable evaluation set, that you’ve vetted carefully. You want to get the annotations complete and correct, and you want to make sure there’s no overlap in your evaluation data and your training data. Setting up a stable evaluation set will make it easier to compare your results as you continue your work.

I also wouldn’t necessarily worry about the difference in accuracy you’re seeing and the normal model. Your annotations are different from the ones the model is trained on initially, and if you use ner.teach, the sorting algorithm has preferred difficult examples.

Hey Matthew,

Thanks for your response! For sure, that’s good to know. I’ll try creating a model from scratch.

With regards to the evaluation set, is it okay if I simply placed the json file created for the evaluation set from using eval split from batch train in a dataset and used that as my separate evaluation set? Or would you recommend actually creating new a dataset with brand new annotations?

Lastly, is there a goal accuracy that I should be looking for?

Thanks,
Sidd