Hi Jen,
First of all --- your workflow sounds good; those sound like all the right steps. Would you mind sharing how long it took to annotate data for the 15 entity types in total (roughly), how much text you've annotated, and what sort of accuracy you're seeing?
On to your questions:
It might be helpful to look at the source of the prodigy/recipes/ner.py
file, to see the implementation of the ner.make-gold
recipe. Here you should see that the function simply runs nlp.pipe()
over the data, using the spaCy model you provide it. So there really shouldn't be a difference between this and what you would get at runtime. You should be able to run exactly the same function. Perhaps it's the sentence splitting being used during ner.make-gold
by default? I can't see much else that might be different. I'd be interested to hear the resolution.
Yes, unless you use dropout or something like that, the predictions should be deterministic. Even the stochastic elements like dropout can be controlled by setting the random.seed()
and numpy.random.seed()
values in your runtime. If you're seeing unstable predictions, this may indicate a bug.
The implementation does support adding new entity types at runtime, by simply resizing the last layer. I have mixed feelings about whether this is a good idea. I think it may be better to retrain the whole model, if all the data is available.
Doing ensembling may work better, although it's worth noting some details about how the model works. The best reference is this video: https://www.youtube.com/watch?v=sqDHBH9IjRU
The main detail is that the model is predicting a distribution over actions at each word, because it follows a transition-based architecture. Basically we have a little state machine, and a set of actions we might take. It's a bit like reinforcement learning, except that we're able to derive the optimal next action given the current state and the objective, so the inference is much easier (this is called "imitation learning").
Because the objective is decide-the-next-action, it might not be so easy to do the ensembling you want. Your outer model will have to support the dynamic loss calculation, because the cost of an action depends on previous history.
Another way to do ensembling would be to get a beam of possible entities, using the beam-search. You can do this with:
ner = nlp.get_pipe('ner')
beams = ner.beam_parse(docs, beam_width=8) # Makes 8-best analyses
for beam in beams:
parses = in ner.moves.get_beam_parses(beam):
The parses
variable will hold a list of (score, ents)
tuples, where ents
is a list of start, end, label
offsets. Your outer model could work as a reranker, deciding which of these proposed parses is the best one, based on additional features you define.
Before doing reranking experiments, you should always perform a couple of small sanity checks to verify that the experiment is worthwhile. The first sanity check is to ask what accuracy you'd get if you had a perfect reranker that always picked the best parse in your beam on your validation data. The second sanity check is to figure out how often the first-ranked parse is actually the highest-scoring one. If the first-ranked parse is the best one say, 98% of the time, there's not much value in a new reranker. Finally, you want to check how the answers to the first two questions change as the beam-size increases. This will tell you what beam-size to focus on. For instance, you may find that your best-case accuracy barely improves after a beam-width of 4, but the top 4 parses are often misranked. This is a good scenario for a reranking model.