The answer to this question is a little bit involved. The short answer is this:
# Number of alternate analyses to consider. More is slower, and not necessarily better -- you need to experiment on your problem.
beam_width = 16
# This clips solutions at each step. We multiply the score of the top-ranked action by this value, and use the result as a threshold. This prevents the parser from exploring options that look very unlikely, saving a bit of efficiency. Accuracy may also improve, because we've trained on greedy objective.
beam_density = 0.0001
nlp = spacy.load('en_core_web_sm')
docs = list(nlp.pipe(texts, disable=['ner']))
beams = nlp.entity.beam_parse(docs, beam_width=beam_width, beam_density=beam_density)
for doc, beam in zip(docs, beams):
entity_scores = defaultdict(float)
for score, ents in nlp.entity.get_beam_parses(beam):
for start, end, label in ents:
entity_scores[(start, end, label)] += score
Here’s the longer explanation. First, spaCy implements two different objectives for named entity parsing:
The greedy imitation learning objective. This objective asks, “Which of the available actions will introduce no new errors if I perform them from this state?” For instance, if we’ve gotten the first word of an entity wrong, and the next word is also not inside an entity, we’re agnostic about whether the fake entity continues, or closes immediately. This makes life easier for the model, because it means the correct action to take next is not defined by whether the current state is correct. There’s some notion of “sunk cost”, basically. This greedy imitation learning objective maximises the expected F1 score, but doesn’t do well at giving per-token probabilities. Once an entity has begun, we might be very confident that we should continue it. So we can’t get good probabilities out of the transition-scores produced for the greedy model.
The global beam-search objective. Instead of optimising the individual transition decisions, the global model asks whether the final parse is correct. To optimise this objective, we build the set of top-k most likely incorrect parses, and top-k most likely correct parses. We assume that the model assigns 0 weight to all parses outside these sets, and use the two sets to estimate the gradient of the loss. We then backprop through all the intermediate states. The beam search allows probabilities over entities to be estimated, because you have multiple analyses. The probability of some entity is then simply the sum of the scores of the parses containing it, normalised by the total score assigned to all parses in the beam.
You can use beam decoding with weights optimised using the greedy procedure. However, without doing any beam updates, the probabilities likely won’t be well calibrated — so the scores may or may not be useful for your application. In Prodigy, we start out with pretrained models, that have usually been optimised with the greedy procedure. During annotation the model will be updated using the beam updates, correcting the initial bias.