So I've been experimenting a bit with splitting annotation data into separate datasets according to their labels so that I can reuse the data later depending on what labels I need. Let's say I have 2 labels: A and B and I saved their annotations (for the same set of text X) into 2 different datasets. Then I used the solution proposed here:
to merge them, trained a model on this new dataset AB and it worked very well for both entities.
Now I have a similar type of text, Y, which also contains entities A and B. I did the same thing: originally created the annotations separately and merged them into AB2, then created a model out of it, which works well for text type Y. I then tried merging annotations for the two texts (AB + AB2) since the dataset for text Y is rather small and was surprised that the model based on AB only (so trained on text type X) performed better on text type Y than the combined model based on AB + AB2 (so including annotations from both text types). Is there any reason for that? Did I merge the datasets in a wrong way? Or maybe I was just unlucky with the data I tested on? I used ner.manual for training and --no-missing when training with ner.batch-train.
It’s kind of hard to speculate, to be honest! If you’re evaluating on different datasets in the two experiments, that could explain a lot, as maybe the evaluation got harder. It’s best to make sure you’ve got a stable evaluation set when you’re making these comparisons.
If you’re working with a stable evaluation data set, one thing you might want to check is the general variance of the model accuracy on your task. Try changing the random seeds inside the ner.batch_train recipe to different values, and see how much that changes the accuracy. On small data sets, sometimes results differ a lot just through random chance, due to the initialisation and order of iteration. If you have high variance, it’s important to know about it, because it tells you that there might not be any interesting explanation for a difference in accuracy between two experiments. High variance can also tell you that you can benefit from tuning your hyper-parameters more effectively. It might help to reduce the learning rate. If you’re not using pretrained word vectors, these can also help reduce variance.
If your variance is low, it’s possible something more interesting could be going on (although that’s not certain still). The mystery might not be easy to solve. I often find that there’s something puzzling in my results that I can’t immediately explain, but in the end I have to move on before I get to the bottom of it.
One thing you might want to check and rule out is whether the annotation standards have shifted slightly during the two annotation sessions. It can be easy to start following a different policy around what you define as an entity of a particular type.