Model accuracy not improving with new gold data

I am trying to improve the NER accuracy with manually corrected gold data. With including the second set, the model’s accuracy is actually not improving. In fact combining the second set reduces the accuracy from 60 to 54. I ran batch-train with the --no-missing flag on because it is all-labels-tagged gold data.
Am I missing something?

Your workflow sounds good. Did you double-check that your data doesn’t contain any accidental conflicting annotations?

And what exactly are you trying to train? It’s always possible that the model struggles to learn and doesn’t converge – for example, if there aren’t enough clues in the local context or if the label scheme is fuzzy.

1 Like

I don’t think they have any conflicting labels. (What i mean by conflict is: accidentally tagging a word as something when it occurs in the clear context of another label - basically wrong labeling) The first set was prepared by ner-manual and the second set by ner-teach with the base model in loop. This is a very specific text in Oil and Gas domain and was tagged by a domain expert with a lot of experience. So, i do not think it has conflicts.

For the second set, are you still training with --no-missing? If you created the data with ner.teach I think that would be problematic.

Also, are you evaluating on the same evaluation data each time, or are you using the data splitting in ner.batch-train? If you’re using a random split and you’re training on different data, the evaluation would be referring to different texts, so figures can vary.

1 Like

Thank you for the explaination @honnibal. Yes i am using the data-splitting in batch-train. But i have a dedicated test set now, i will check on it and get back with numbers.

Yes, i used --no-missing with the second set created by ner-teach. But How does --no-missing affect if i created the second round data with ner.teach? I used a custom recipe that makes it to gold data.

Ok, with the dedicated gold set, i see a slight improvement. these are the numbers that i get with the previous set and added set. Both are evaluated with same gold test data.

Round 1: Training: (manual + some gold combined)

	With no-missing:
		Precision : 70.77
		Recall : 58.12
		F Score : 63.83
	Without no-missing:
		Precision : 59.00
		Recall : 57.70
		F Score : 58.3

Round 2: Training Round1 + new gold data of round 2
	With no-missing:
		Precision : 67.37
		Recall : 61.36
		F Score : 64.22
	Without no-missing:
		Precision : 58.09
		Recall : 62.51
		F Score : 60.22

The precision has reduced and recall & fscore has increased. That means it is picking up the right ones rather than picking up too many. But I think it is pretty low for added set because the first model had 1800 annotated comments for training and the second set had 3200 for training. I expected significant learning for 1400 extra annotated text.

Ah, okay. If you’ve got a workflow that takes the ner.teach output and ensures that you’ve created complete and correct annotations from it, then it’s right to use --no-missing.

It’s tough to say why the new data is confusing the model. One possibility is that you’re simply getting unlucky. You’ve got three sets, right? Evaluation, Train1, Train2. If the texts in Train1 happen to be more similar to the texts in Evaluation, then training on Train1+Train2 could make the accuracy lower?

I think you should probably just dig into the training sets and see if you can figure out what’s wrong. The ner.print-dataset command is useful for this, because it lets you page through the examples in a terminal. Look especially for mistaken annotations in Train2, or perhaps differences of opinion about annotations. Also double check that there’s no accidental overlap of texts between Evaluation and Train1. You might also train on only Train2, without Train1, and see what sort of model results. If you look at the errors of that model, you might have a clearer picture of what’s going wrong.

1 Like

Yah, probably i should look into the data. also separate 2 evaluation sets from both and cross verify the numbers. I will get back with those numbers. Also so far, the model looks like getting to saturation in ~70 precision. how do i change the set of features the algorithm considers? 80+ will be a good place to have a model.

There’s not really a good way to add features at the moment. Fortunately it should also be fairly unnecessary. What you should do instead is try fiddling with the hyper-parameters. You can set the learning rate via the learn_rate environment variable, and you can change the dropout and batch size with arguments to the Prodigy recipe.

You can some more thoughts about spaCy’s default hyper-parameters here:

You might also find the ner.train-curve recipe useful. This lets you see how the accuracy is improving with more data. Another thing to take note of is, try using the training data as the evaluation set, and see which training examples the model persistently gets wrong. Have a look at those examples, and consider removing them from the training set. Remember that the optimization objective is to reach 0 loss on all of the training data you give it. When you think about it, it’s not surprising that some training examples might result in worse generalisation.

1 Like