Prodigy model not learning, spaCy model ~90% F1 score

I’m trying to demonstrate that I can get reasonably similar results by training a NER model with Prodigy as with spaCy, but failing. I have several thousand examples fully annotated in the spaCy format, with several additional entity types beyond the pre-trained ones. My spaCy training loop is pretty straight forward, I use a compounding batch size 4->16, train for just ~5 epochs and my accuracy is around 90% (this is an artificially generated dataset so may be a bit homogeneous). I took the same data, converted the format to Prodigy’s format:

   "text":"who are W Chen and Jane M Doe",

and generated an equal amount of “reject” data by randomising all the entity labels. Feeding this dataset to prodigy ner.batch-train does not work very well. The “before” accuracy is ~0.25, and right after the first epoch it’s quite close to 0.5, and stays there. I’m therefore guessing it just predicts “accept” or “reject” for everything. Note that this is for using the --no-missing flag, if I don’t use it then the accuracy jumps to ~0.6-0.7 and then moves towards 0.5.

I then saw a post (which I can’t currently find again), where @honnibal mentioned that the "answer":"accept" field should be added to any entity spans, so I tried that. But that resulted in 0.0 accuracy, so I think that must have been wrong.

Any idea what could be going wrong? I’ve tried reducing the learning rate by an order of magnitude, tried changing batch size. Happy to give more detailed information about my data/problem on request.

Hmm. Could you provide the Prodigy command you’re running on the command line?

One possibility I’m thinking is, maybe you’re starting off with a pretrained model in spaCy? If so, you’d be updating the existing weights, and the model would start off predicting the other entities. This can make it harder to learn the new task you’re providing.

This might not be the problem, but it’s the first thing that comes to mind.

Yes, I’m using the pretrained en-core-web-sm model in both the spaCy and Prodigy setting. The prodigy command:

prodigy ner.batch-train my_dataset en_core_web_sm --n-iter 5 --batch-size 32 --no-missing -o model_output

You’re suggesting I start out with a blank model in Prodigy instead? Worth a try… I just thought it was better generally to take advantage of the transfer learning by using the pretrained models. In my spaCy training loop I actually do some pseudo-rehearsal of more general news text, maybe that reduces the need to start out with a pre-trained model? (would of course still “annotate” the rehearsal data with a pre-trained model first.)

@honnibal, I can confirm that starting with a blank model in Prodigy does not improve things, the accuracy stalls at ~0.35

Is your batch size too high? 32 is quite a lot higher than the batch size you’re using in spaCy (4 to 16). Also, try starting with en_vectors_web_lg so that you still have the word vectors.

Neural networks can be pretty delicate sometimes: seemingly small differences can determine whether the model converges well or doesn’t. It could be that tricks like the pseudo-rehearsal you’re doing in the spaCy code are very important.

Ultimately Prodigy can’t provide an all-in-one, ultimate and infallible solution to training an NER model. The ner.batch-train recipe does have pretty reasonable defaults, that make it easy to experiment. But there will always be datasets where you can get better results by implementing a slightly different training loop, including using some of the tricks we use in the spacy train recipe, such as varying the batch size during training.

Ah, in my comparison testing I’m running my spaCy loop without rehearsal data, still converges well. I’m a little all over the place trying to get the Prodigy training to converge… tried various batch sizes as well. Also tried restricting only to previously known entity labels. I’ll keep brute-forcing it a bit longer, I understand that it’s hard to diagnose without access to the data.

If the spaCy loop is converging, there’s nothing wrong with exporting the data and training with spaCy. Here’s a bit more context around the batch sizing, in case it’s helpful.

spaCy’s NER and parsing model uses an imitation learning objective. The models are transition-based: the parser initialises a state object with a stack, a buffer, and a data structure to hold the partial analysis. What the model is actually predicting are actions that manipulate the state. For the NER model, we have an action which extends a current entity, an action which begins a new entity, an action which creates a single-word entity, etc.

The imitation learning objective comes in because during training, we’re updating the state based on the previous prediction, and then asking “Given the gold-standard, what action can we take next that will let us get to an equally good final outcome?”. After you’ve begun an incorrect new entity, if the next word doesn’t start a new entity, the objective is indifferent to whether you extend this incorrect entity, or end it immediately. That flexibility lets us learn better solutions, because we don’t constrain the objective with decisions we’re indifferent to.

The other big advantage of the imitation learning is that we see states that result from incorrect previous predictions. The simpler way of training these transition-based models doesn’t see those error states, so once the model makes a mistake at runtime, you end up in a state unlike what you’ve seen during training.

However, there can be a difficulty at the start of training. When the model begins training, we’re predicting states sort of at random. So we’re training from a not-very-interesting part of the state space. We want to escape from this situation quickly, so we want to start making updates as quickly as possible. This motivates a small batch size towards the start of training. However, the small batch size might not be optimal for learning once we’ve gotten over this starting problem. That’s why spaCy supports this increasing batch size solution, even though it’s not really popular for other neural network models.

The more technical/abstract way to say all this is: because we’re using imitation learning, the objective function we’re learning is non-stationary. Because the objective is non-stationary, the learning dynamics are a bit different. This means the batch sizing can matter in different ways from many neural network models. I understand that GANs and reinforcement learning models have similar considerations, although I haven’t worked with either of those directly.

Thanks for the explanation @honnibal - I might need to think/read more about this to fully grasp it though.

I know that Prodigy differs in the objective at the highest level, i.e. predicts accept/reject only, but do you benefit from starting with a small batch size and gradually increasing it with Prodigy as well?

I’m seeing a little bit of progress getting my Prodigy model to learn from my old training data, but only without the --no-missing flag. Getting ~80% accuracy after a few epochs, seems to stall after that but I haven’t tried using my full data yet (Prodigy trains quite a bit slower than spaCy in my experience?). Can probably improve that by playing around.

But I still can’t get anywhere with --no-missing@honnibal, would it be possible to get a working example script based on the example Reddit data on Github (, which I assume is “complete”? I tried importing it and training on it with --no-missing, but no luck - accuracy stays put at 0.0.

Are you starting from en_core_web_sm? That won’t work with --no-missing, as you’re starting with a model with lots of entity types, and then not showing it any examples of those.

My bad - the Reddit dataset loaded improperly, I had to write a python script to load it. Looks like it’s working much better now. (Do you know roughly what parameters to use to maximise accuracy on this dataset?)

Could you maybe elaborate on the point about starting from a blank model specifically when using --no-missing? I get that a pretrained model with no overlap of entities with the training data isn’t a very good starting point, but how does it differ with and without --no-missing?

PS: I ran the training with both a blank model and the en_core_web_sm, works all right in the latter case as well.

Ah I think I misread from your initial message. If you start from a model with entities like ORG, PERSON, DATE etc and then train a new label like TECH, without the other entity types being represented in the data, then starting from a non-blank model wouldn’t work. But in your case it should be fine.