Difference in quality in make-gold vs trained model's annotations (and others)


My team has been using Prodigy to train a NER and we have a couple of questions. For context, we’ve trained a NER (15 entity types) from scratch - first using ner.manual to annotate a set of data, training a model from these annotations, then moving onto an annotation phase using ner.make-gold. We then retrain that model using the larger set of annotated data, then use it to automatically annotate the example which were not manually labelled. We were wondering:

  1. There seems to be a significant difference in the quality of the automatically annotations vs what is suggested by the model during the ner.make-gold phase. It seems like the model is missing out entities which would typically be suggested when annotating with make-gold. Any idea why this is the case?

  2. If we use a trained model to make annotate a new set of data, are the labels going be consistent over every run - if we used the same model to annotate a set of data, would we expect the labels to be the same every time?

  3. If we wanted to do transfer learning using the pre-trained models, would it be possible to load a pre-trained model and retrain the last layer with a new set of entities? Alternatively, would it be possible to get the entire probability distribution from the last layer and feed that into another network?




Hi Jen,

First of all --- your workflow sounds good; those sound like all the right steps. Would you mind sharing how long it took to annotate data for the 15 entity types in total (roughly), how much text you've annotated, and what sort of accuracy you're seeing?

On to your questions:

It might be helpful to look at the source of the prodigy/recipes/ner.py file, to see the implementation of the ner.make-gold recipe. Here you should see that the function simply runs nlp.pipe() over the data, using the spaCy model you provide it. So there really shouldn't be a difference between this and what you would get at runtime. You should be able to run exactly the same function. Perhaps it's the sentence splitting being used during ner.make-gold by default? I can't see much else that might be different. I'd be interested to hear the resolution.

Yes, unless you use dropout or something like that, the predictions should be deterministic. Even the stochastic elements like dropout can be controlled by setting the random.seed() and numpy.random.seed() values in your runtime. If you're seeing unstable predictions, this may indicate a bug.

The implementation does support adding new entity types at runtime, by simply resizing the last layer. I have mixed feelings about whether this is a good idea. I think it may be better to retrain the whole model, if all the data is available.

Doing ensembling may work better, although it's worth noting some details about how the model works. The best reference is this video: https://www.youtube.com/watch?v=sqDHBH9IjRU

The main detail is that the model is predicting a distribution over actions at each word, because it follows a transition-based architecture. Basically we have a little state machine, and a set of actions we might take. It's a bit like reinforcement learning, except that we're able to derive the optimal next action given the current state and the objective, so the inference is much easier (this is called "imitation learning").

Because the objective is decide-the-next-action, it might not be so easy to do the ensembling you want. Your outer model will have to support the dynamic loss calculation, because the cost of an action depends on previous history.

Another way to do ensembling would be to get a beam of possible entities, using the beam-search. You can do this with:

ner = nlp.get_pipe('ner')
beams = ner.beam_parse(docs, beam_width=8) # Makes 8-best analyses
for beam in beams:
    parses = in ner.moves.get_beam_parses(beam):

The parses variable will hold a list of (score, ents) tuples, where ents is a list of start, end, label offsets. Your outer model could work as a reranker, deciding which of these proposed parses is the best one, based on additional features you define.

Before doing reranking experiments, you should always perform a couple of small sanity checks to verify that the experiment is worthwhile. The first sanity check is to ask what accuracy you'd get if you had a perfect reranker that always picked the best parse in your beam on your validation data. The second sanity check is to figure out how often the first-ranked parse is actually the highest-scoring one. If the first-ranked parse is the best one say, 98% of the time, there's not much value in a new reranker. Finally, you want to check how the answers to the first two questions change as the beam-size increases. This will tell you what beam-size to focus on. For instance, you may find that your best-case accuracy barely improves after a beam-width of 4, but the top 4 parses are often misranked. This is a good scenario for a reranking model.