Prodigy model not learning, spaCy model ~90% F1 score

(Einar Bui Magnusson) #1

I’m trying to demonstrate that I can get reasonably similar results by training a NER model with Prodigy as with spaCy, but failing. I have several thousand examples fully annotated in the spaCy format, with several additional entity types beyond the pre-trained ones. My spaCy training loop is pretty straight forward, I use a compounding batch size 4->16, train for just ~5 epochs and my accuracy is around 90% (this is an artificially generated dataset so may be a bit homogeneous). I took the same data, converted the format to Prodigy’s format:

   "text":"who are W Chen and Jane M Doe",

and generated an equal amount of “reject” data by randomising all the entity labels. Feeding this dataset to prodigy ner.batch-train does not work very well. The “before” accuracy is ~0.25, and right after the first epoch it’s quite close to 0.5, and stays there. I’m therefore guessing it just predicts “accept” or “reject” for everything. Note that this is for using the --no-missing flag, if I don’t use it then the accuracy jumps to ~0.6-0.7 and then moves towards 0.5.

I then saw a post (which I can’t currently find again), where @honnibal mentioned that the "answer":"accept" field should be added to any entity spans, so I tried that. But that resulted in 0.0 accuracy, so I think that must have been wrong.

Any idea what could be going wrong? I’ve tried reducing the learning rate by an order of magnitude, tried changing batch size. Happy to give more detailed information about my data/problem on request.

(Matthew Honnibal) #2

Hmm. Could you provide the Prodigy command you’re running on the command line?

One possibility I’m thinking is, maybe you’re starting off with a pretrained model in spaCy? If so, you’d be updating the existing weights, and the model would start off predicting the other entities. This can make it harder to learn the new task you’re providing.

This might not be the problem, but it’s the first thing that comes to mind.

(Einar Bui Magnusson) #3

Yes, I’m using the pretrained en-core-web-sm model in both the spaCy and Prodigy setting. The prodigy command:

prodigy ner.batch-train my_dataset en_core_web_sm --n-iter 5 --batch-size 32 --no-missing -o model_output

You’re suggesting I start out with a blank model in Prodigy instead? Worth a try… I just thought it was better generally to take advantage of the transfer learning by using the pretrained models. In my spaCy training loop I actually do some pseudo-rehearsal of more general news text, maybe that reduces the need to start out with a pre-trained model? (would of course still “annotate” the rehearsal data with a pre-trained model first.)

(Einar Bui Magnusson) #4

@honnibal, I can confirm that starting with a blank model in Prodigy does not improve things, the accuracy stalls at ~0.35

(Matthew Honnibal) #5

Is your batch size too high? 32 is quite a lot higher than the batch size you’re using in spaCy (4 to 16). Also, try starting with en_vectors_web_lg so that you still have the word vectors.

Neural networks can be pretty delicate sometimes: seemingly small differences can determine whether the model converges well or doesn’t. It could be that tricks like the pseudo-rehearsal you’re doing in the spaCy code are very important.

Ultimately Prodigy can’t provide an all-in-one, ultimate and infallible solution to training an NER model. The ner.batch-train recipe does have pretty reasonable defaults, that make it easy to experiment. But there will always be datasets where you can get better results by implementing a slightly different training loop, including using some of the tricks we use in the spacy train recipe, such as varying the batch size during training.

(Einar Bui Magnusson) #6

Ah, in my comparison testing I’m running my spaCy loop without rehearsal data, still converges well. I’m a little all over the place trying to get the Prodigy training to converge… tried various batch sizes as well. Also tried restricting only to previously known entity labels. I’ll keep brute-forcing it a bit longer, I understand that it’s hard to diagnose without access to the data.

(Matthew Honnibal) #7

If the spaCy loop is converging, there’s nothing wrong with exporting the data and training with spaCy. Here’s a bit more context around the batch sizing, in case it’s helpful.

spaCy’s NER and parsing model uses an imitation learning objective. The models are transition-based: the parser initialises a state object with a stack, a buffer, and a data structure to hold the partial analysis. What the model is actually predicting are actions that manipulate the state. For the NER model, we have an action which extends a current entity, an action which begins a new entity, an action which creates a single-word entity, etc.

The imitation learning objective comes in because during training, we’re updating the state based on the previous prediction, and then asking “Given the gold-standard, what action can we take next that will let us get to an equally good final outcome?”. After you’ve begun an incorrect new entity, if the next word doesn’t start a new entity, the objective is indifferent to whether you extend this incorrect entity, or end it immediately. That flexibility lets us learn better solutions, because we don’t constrain the objective with decisions we’re indifferent to.

The other big advantage of the imitation learning is that we see states that result from incorrect previous predictions. The simpler way of training these transition-based models doesn’t see those error states, so once the model makes a mistake at runtime, you end up in a state unlike what you’ve seen during training.

However, there can be a difficulty at the start of training. When the model begins training, we’re predicting states sort of at random. So we’re training from a not-very-interesting part of the state space. We want to escape from this situation quickly, so we want to start making updates as quickly as possible. This motivates a small batch size towards the start of training. However, the small batch size might not be optimal for learning once we’ve gotten over this starting problem. That’s why spaCy supports this increasing batch size solution, even though it’s not really popular for other neural network models.

The more technical/abstract way to say all this is: because we’re using imitation learning, the objective function we’re learning is non-stationary. Because the objective is non-stationary, the learning dynamics are a bit different. This means the batch sizing can matter in different ways from many neural network models. I understand that GANs and reinforcement learning models have similar considerations, although I haven’t worked with either of those directly.