Displaying a confidence score next to a user-defined entity

Hi,

I came accross one of your posts for generating scores for the entities using beam. I had a couple of questions regarding this topic:

It gives me a “sum of the scores of the parses containing it” based score for named entities defined by out-of-the-box spacy model such as “LOC”, “PER” etc.
I build a prodigy model to identity labels such as “LOCATION” which shows a field name in some location.

My question is when I run the beam code, what exactly do 103,345,1109 mean? Is it a sum-of-scores based on parsing or confidence score? Is there a way I can get a confidence score to rank all the “LOCATION” labels.

103 Trym LOCATION
345 Trym LOCATION
1109 Trym LOCATION
0.9997078684409141 []
0.0001524225082572684 [(0, 1, ‘CARDINAL’)]
9.968089809345492e-05 [(0, -1, ‘DATE’)]
3.133614358889278e-05 [(0, -1, ‘CARDINAL’)]
8.692009146304466e-06 [(0, -1, ‘PERSON’)]
0.999980689646118 []
1.7380253187471615e-05 [(0, -1, ‘DATE’)]
9.215215413671059e-07 [(0, -1, ‘PERSON’)]
7.847656971651813e-07 [(0, -1, ‘CARDINAL’)]
1.0250886506604742e-07 [(0, -1, ‘MONEY’)]
9.080580460904744e-08 [(0, -1, ‘GPE’)]
2.876767626301381e-08 [(0, -1, ‘ORG’)]
1.4568741310754125e-09 [(0, -1, ‘NORP’)]
2.742360438007307e-10 [(0, -1, ‘FAC’)]
0.999980689646118 []

Can you tell me how to get a confidence score for user-defined labels?

Thank you

Could you link the comment? I want to be sure which bit of code is being used.

The scores here are probably “logits” – negative log likelihoods. They might be unnormalised, in which case they’re really just scores, not probabilities.

Sure, here's the link.

Also, Here is the code:

text = content
doc = nlp2(text)
for ent in doc.ents:
    print (ent.start_char, ent.text, ent.label_)
    docs = list(nlp.pipe(list(text), disable=['ner']))
(beams, somethingelse) = nlp.entity.beam_parse(docs, beam_width=16, beam_density=0.0001)

print (content)

for beam in beams:
	for score, ents in nlp.entity.moves.get_beam_parses(beam):
		print (score, ents)

		entity_scores = defaultdict(float)
		for start, end, label in ents:
			# print ("here")
			entity_scores[(start, end, label)] += score
	print ('entity_scores', entity_scores)

The output:
0.9997078684409141 []
0.0001524225082572684 [(0, 1, ‘CARDINAL’)]
9.968089809345492e-05 [(0, -1, ‘DATE’)]
3.133614358889278e-05 [(0, -1, ‘CARDINAL’)]
8.692009146304466e-06 [(0, -1, ‘PERSON’)]
entity_scores defaultdict(<class ‘float’>, {(0, -1, ‘PERSON’): 8.692009146304466e-06})
0.999980689646118 []
1.7380253187471615e-05 [(0, -1, ‘DATE’)]
9.215215413671059e-07 [(0, -1, ‘PERSON’)]
7.847656971651813e-07 [(0, -1, ‘CARDINAL’)]
1.0250886506604742e-07 [(0, -1, ‘MONEY’)]
9.080580460904744e-08 [(0, -1, ‘GPE’)]
2.876767626301381e-08 [(0, -1, ‘ORG’)]
1.4568741310754125e-09 [(0, -1, ‘NORP’)]
2.742360438007307e-10 [(0, -1, ‘FAC’)]
entity_scores defaultdict(<class ‘float’>, {(0, -1, ‘FAC’): 2.742360438007307e-10})

So basically, it shows scores for defined entities but not user entities.
Is there a way to include user-defined entities like “LOCATION”?

Thank you

You can see the code for this method here: https://github.com/explosion/spaCy/blob/master/spacy/syntax/ner.pyx#L122

As you can see, it doesn’t handle entities you’ve added specially. So if you’re not seeing any entities of your types, that suggests it’s not assigning any probability to those entities in the beam. You’ll likely have to train more.

Oh okay. I used ner.teach on the data till I reached annotating 97% samples and trained it using ner.batch-train and it gave an accuracy of 79%.

Is there any way I can quantify the confidence with which the not-so-well-trained model is predicting the labels? Basically an estimated rank/score/confidence to show the entity label predicted by the entity recognizer.

Thank you

Hmm. Actually that code looks weird. Try this:

text = content
doc = nlp.make_doc(text)
(beams, somethingelse) = nlp.entity.beam_parse([doc], beam_width=16, beam_density=0.0001)
for score, ents in nlp.entity.moves.get_beam_parses(beams[0]):
    print (score, ents)
    entity_scores = defaultdict(float)
    for start, end, label in ents:
        # print ("here")
        entity_scores[(start, end, label)] += score
        print ('entity_scores', entity_scores)
1 Like

Hi Matthew,

Thanks you for sharing this piece of code. I had a question about the interpretation of these scores. For example, following is the output for one of the documents:

Code:

text = content
doc = nlp.make_doc(text)
(beams, somethingelse) = nlp.entity.beam_parse([doc], beam_width=16, beam_density=0.0001)
for score, ents in nlp.entity.moves.get_beam_parses(beams[0]):
    print (score, ents)
    entity_scores = defaultdict(float)
    for start, end, label in ents:
        # print ("here")
        entity_scores[(start, end, label)] += score
        print ('entity_scores', entity_scores)

for (start, end, label),value in entity_scores.items():
		if label == 'LOCATION':
			print (start, tokens[start], value)

Output:
[(902, 'HUMBLY', 0.16623281600085096), (999, 'Hinton', 0.16623281600085096), (1627, 'Horndean', 0.16623281600085096), (2067, 'Horndean', 0.16623281600085096), (2712, 'Set', 0.16623281600085096), (3548, 'Horndean', 0.16623281600085096)]

In one of the support board, you had mentioned that “The probability of some entity is then simply the sum of the scores of the parses containing it, normalised by the total score assigned to all parses in the beam.”
My question is why are the confidences scores of every label the same? Are they normalized based on the number of beam widths( 16 in this case)? Is there a way to rank the confidence for each LOCATION before the score is averaged out?

Thank you so much. I’ve been looking at the GitHub pages to understand the concepts in depth. Appreciate all the help!

Hi @Jashmi1,

Sorry I didn’t see this sooner. The reason you’ll see the same score coming up a lot is because there are only a few candidates in the beam. So if two entities occur in the same two parses of the beam, they’ll receive the same score. It’s far from a perfect way to estimate probabilities, but it’s the best solution we have in spaCy at the moment.

Hey Matt, the spans produced by your code do not match extracted entity spans (please see the output below)... I'm seeing this in other examples that are floating around... how are we supposed to match these confidence scores to named entities?

Output:

Trump says he's answered Mueller's Russia inquiry questions – live
0.5167307507127545 [(0, 1, 'PERSON'), (5, 6, 'PERSON'), (7, 8, 'GPE')]
0.11855046319244189 [(0, 1, 'PERSON'), (5, 6, 'GPE'), (7, 8, 'GPE')]
0.09203954275761955 [(0, 1, 'PERSON'), (5, 6, 'ORG'), (7, 8, 'GPE')]
0.09147111404381013 [(0, 1, 'PERSON'), (7, 8, 'GPE')]
0.05509786664232669 [(0, 1, 'ORG'), (5, 6, 'GPE'), (7, 8, 'GPE')]
0.04277657223867647 [(0, 1, 'ORG'), (5, 6, 'ORG'), (7, 8, 'GPE')]
0.04251206327276357 [(0, 1, 'ORG'), (7, 8, 'GPE')]
0.021270337832401842 [(0, 1, 'WORK_OF_ART'), (5, 6, 'PERSON'), (7, 8, 'GPE')]
0.010583704059039257 [(0, 1, 'PERSON'), (5, 6, 'PRODUCT'), (7, 8, 'GPE')]
0.0064581979019569715 [(5, 6, 'PERSON'), (7, 8, 'GPE')]
0.0014173903772879402 [(0, 1, 'PERSON'), (5, 7, 'ORG'), (7, 8, 'GPE')]
0.0006587455507824188 [(0, 1, 'ORG'), (5, 7, 'ORG'), (7, 8, 'GPE')]
0.00029578261045175785 [(0, 1, 'PERSON'), (5, 6, 'PERSON'), (7, 8, 'NORP')]
0.0001374688076868621 [(0, 1, 'ORG'), (5, 6, 'PERSON'), (7, 8, 'NORP')]

Alexander Zverev reaches ATP Finals semis then reminds Lendl who is boss
0.8122696184745521 [(0, 2, 'PERSON'), (8, 9, 'PERSON')]
0.16061426268417747 [(0, 2, 'PERSON'), (8, 9, 'GPE')]
0.009975386597010165 [(0, 2, 'PERSON'), (8, 9, 'ORG')]
0.0058712557256443 [(0, 2, 'PERSON')]
0.0033844387968237043 [(0, 2, 'PERSON'), (8, 9, 'EVENT')]
0.003130046443562457 [(0, 2, 'PERSON'), (8, 9, 'LOC')]
0.0022775503442237464 [(0, 2, 'PERSON'), (8, 9, 'DATE')]
0.0007166247398203144 [(0, 2, 'PERSON'), (8, 9, 'WORK_OF_ART')]
0.0005546103011753504 [(0, 2, 'PERSON'), (8, 9, 'TIME')]
0.0004147189385241843 [(0, 2, 'PERSON'), (3, 4, 'ORG'), (8, 9, 'PERSON')]
0.0003660859211275216 [(0, 2, 'PERSON'), (3, 4, 'ORG'), (8, 9, 'GPE')]
0.00020208086291369524 [(0, 2, 'PERSON'), (8, 9, 'ORDINAL')]
0.00012847589420599448 [(0, 2, 'PERSON'), (8, 9, 'NORP')]
9.484427623898815e-05 [(0, 2, 'PERSON'), (8, 9, 'CARDINAL')]

Britain's worst landlord to take nine years to pay off string of fines
1.0 [(0, 1, 'GPE'), (6, 8, 'DATE')]

Tom Watson: people's vote more likely given weakness of May's position
0.930427843387749 [(0, 2, 'PERSON'), (11, 12, 'DATE')]
0.033269318831955434 [(0, 2, 'PERSON')]
0.01146200992116456 [(0, 2, 'PERSON'), (11, 12, 'PERSON')]
0.010313748169603341 [(0, 2, 'PERSON'), (11, 12, 'ORG')]
0.006974692242323361 [(0, 2, 'PERSON'), (11, 12, 'EVENT')]
0.004421763914022536 [(0, 2, 'PERSON'), (11, 12, 'WORK_OF_ART')]
0.0014732389947636328 [(0, 2, 'PERSON'), (11, 12, 'GPE')]
0.0009557775256411386 [(0, 2, 'PERSON'), (11, 12, 'TIME')]
0.0004951325909690101 [(0, 2, 'PERSON'), (11, 12, 'FAC')]
0.00020647442180795264 [(0, 2, 'PERSON'), (11, 12, 'LAW')]

Any guidance on this would be extremely helpful. Perhaps you know what I'm missing, @ines?

What about the spans doesn't match? The indices refer to the tokens indices (which you can convert to character offsets), and each line represents the confidence of the given parse of the sentence.

Hello Ines, Thank you for your message (and sorry I didn't respond sooner - I hadn't seen your message until just now)!

I think I understand! Knowing that those indices refer to token spans rather than character spans was key. So thank you for clarifying :slight_smile:

With that in mind, I believe there is a mistake in @honnibal's original code. Matt said the confidence scores are supposed to be the sums of each entity score across all beams divided by the total confidence score across all beams. His code, as originally written, says:

text = content
doc = nlp.make_doc(text)
(beams, somethingelse) = nlp.entity.beam_parse([doc], beam_width=16, beam_density=0.0001)
for score, ents in nlp.entity.moves.get_beam_parses(beams[0]):
    print (score, ents)
    entity_scores = defaultdict(float) #Move this line...
    for start, end, label in ents:
        entity_scores[(start, end, label)] += score
        print ('entity_scores', entity_scores)

but this overwrites the summation calculation within each beam parse loop. Instead, I think he meant to write:

text = content
doc = nlp.make_doc(text)
(beams, somethingelse) = nlp.entity.beam_parse([doc], beam_width=16, beam_density=0.0001)
entity_scores = defaultdict(float) #This line is moved above the for loop
for score, ents in nlp.entity.moves.get_beam_parses(beams[0]):
    print (score, ents)
    for start, end, label in ents:
        entity_scores[(start, end, label)] += score
        print ('entity_scores', entity_scores)

This way, the score summation can be applied across all like entities across all beam parses.

I think the essence of the question "How can I see spacy's confidence scores" is so we can judge whether to ignore or accept a recognized entity. To that end, below I am including the code to do exactly that - rejecting any entity with less than 90% confidence:

from collections import defaultdict
import spacy
nlp = spacy.load('en_core_web_sm')

examples = [
    "Trump says he's answered Mueller's Russia inquiry questions \u2013 live",
    "Alexander Zverev reaches ATP Finals semis then reminds Lendl who is boss",
    "Britain's worst landlord to take nine years to pay off string of fines",
    "Tom Watson: people's vote more likely given weakness of May's position",
]

def filter_low_confidence_entities(entities, cutoff = 0.8):
  return {key: value for key, value in entities.items() if value > cutoff}

for text in examples:
  doc = nlp.make_doc(text)
  beams = nlp.entity.beam_parse([doc], beam_width=16, beam_density=0.0001)
  entity_scores = defaultdict(float)
  total_score = 0
  print(text)
  for score, ents in nlp.entity.moves.get_beam_parses(beams[0]):
    total_score += score
    for start, end, label in ents:
      entity_scores[(start, end, label)] += score

  normalized_beam_score = {dict_key: dict_value/total_score for dict_key, dict_value in entity_scores.items()}
  high_confidence_entities = filter_low_confidence_entities(normalized_beam_score, 0.9)
  high_confidence_entity_texts = {key: doc[int(key[0]): int(key[1])] for key, value in high_confidence_entities.items()}

  print(' All entities with their normalized beam score:', normalized_beam_score)
  print(' Entities with over 90% confidence:', high_confidence_entities)
  print(' Text of entities with over 90% confidence:', high_confidence_entity_texts)
  print()

This generates the output:

Trump says he's answered Mueller's Russia inquiry questions – live
 All entities with their normalized beam score: {(0, 1, 'PERSON'): 0.8310895450340546, (5, 6, 'PERSON'): 0.544892847991814, (7, 8, 'GPE'): 0.9995667492150351, (5, 6, 'GPE'): 0.17364807605704594, (5, 6, 'ORG'): 0.1348159179697731, (0, 1, 'ORG'): 0.14118210418236288, (0, 1, 'WORK_OF_ART'): 0.021270195958141717, (5, 6, 'PRODUCT'): 0.010583714212205777, (5, 7, 'ORG'): 0.0020761379197505566, (7, 8, 'NORP'): 0.00043325078496490415}
 Entities with over 90% confidence: {(7, 8, 'GPE'): 0.9995667492150351}
 Text of entities with over 90% confidence: {(7, 8, 'GPE'): Russia}

Alexander Zverev reaches ATP Finals semis then reminds Lendl who is boss
 All entities with their normalized beam score: {(0, 2, 'PERSON'): 1.0, (8, 9, 'PERSON'): 0.8126843355674371, (8, 9, 'GPE'): 0.16098034886646048, (8, 9, 'ORG'): 0.009975386613193055, (8, 9, 'EVENT'): 0.0033844388023142186, (8, 9, 'LOC'): 0.0031300464486402756, (8, 9, 'DATE'): 0.0022775503479185756, (8, 9, 'WORK_OF_ART'): 0.0007166247409828817, (8, 9, 'TIME'): 0.0005546103020750846, (3, 4, 'ORG'): 0.0007808016968760548, (8, 9, 'ORDINAL'): 0.00020208240500204164, (8, 9, 'NORP'): 0.0001284758944144186, (8, 9, 'CARDINAL'): 9.48442763928523e-05}
 Entities with over 90% confidence: {(0, 2, 'PERSON'): 1.0}
 Text of entities with over 90% confidence: {(0, 2, 'PERSON'): Alexander Zverev}

Britain's worst landlord to take nine years to pay off string of fines
 All entities with their normalized beam score: {(0, 1, 'GPE'): 1.0, (6, 8, 'DATE'): 1.0}
 Entities with over 90% confidence: {(0, 1, 'GPE'): 1.0, (6, 8, 'DATE'): 1.0}
 Text of entities with over 90% confidence: {(0, 1, 'GPE'): Britain, (6, 8, 'DATE'): nine years}

Tom Watson: people's vote more likely given weakness of May's position
 All entities with their normalized beam score: {(0, 2, 'PERSON'): 1.0, (11, 12, 'DATE'): 0.9304278463190857, (11, 12, 'PERSON'): 0.011462009957275915, (11, 12, 'ORG'): 0.010313748202097068, (11, 12, 'EVENT'): 0.0069746922642973065, (11, 12, 'WORK_OF_ART'): 0.004421763927953416, (11, 12, 'GPE'): 0.0014732389994051099, (11, 12, 'TIME'): 0.00095577752865234, (11, 12, 'FAC'): 0.0004951325925289379, (11, 12, 'LAW'): 0.00020647127193283314}
 Entities with over 90% confidence: {(0, 2, 'PERSON'): 1.0, (11, 12, 'DATE'): 0.9304278463190857}
 Text of entities with over 90% confidence: {(0, 2, 'PERSON'): Tom Watson, (11, 12, 'DATE'): May}

Interestingly this eliminates a lot of Entity candidates. And there's even one false positive in here where May was recognized as a date rather than part of a person's name (I believe Theresa "May"). Do you / your team have any suggestions for:

  1. eliminating/reducing these false positives?
  2. boosting the confidence of other recognized entities?
  3. can you recommend a strategy for when or if to accept different entity candidates at different confidence thresholds?

Please let me know if something looks off but otherwise thanks again for your help! And for the amazing libraries your team makes!! :slight_smile:

Hi Andrew,

Thanks, yes that correction looks right to me!

I think the main thing we want to improve here is the callibration of the scores: we want the scores to be such that if spaCy assigns a probability of 0.9, then 90% of them are correct.

Without doing the beam training, there's probably not much chance of the scores being well-calibrated. The models are by default trained with a greedy objective that tries to minimise the number of mistakes in the parse. The "probabilities" for the beam trained with this greedy objective represent how confident the model was that the next action is the least-worst one to take given the current state. The model can be in a state that's bad, and very confident that the next action is the least-worst thing to do. This action will receive a high score. In contrast you could have another candidate that's in a good state, but the model is unsure which action to take next, resulting in a lower score for the two options. These two continuations will each receive lower scores. So the greedy objective doesn't train the beam to do the right sort of thing.

What you can do to correct this is create some analyses by running the greedy model, to generate messily-labeled training data, and then running some iterations of training using a beam width above 1. This should help, but the calibration of the scores still won't be wonderful, which is why the procedures to use it aren't so smooth. We didn't pave the path around these things, because I'm not that confident about the results.

I'm actually working on improving the beam parsing for v3 right now, including efficiency improvements. The configuration management in v3 also makes it much easier to expose the various options.

Hey Matt, thank you for this very nice elaboration :slight_smile: So if I'm correctly connecting what you're saying with what I just learned about beam search, then you're saying that Spacy models are trained using a beam search method with only 1 beam (hence maximally greedy). However, we can set the number of beams to be any number larger than 1 (the larger this value is, the closer to breadth search - hence less greedy - it becomes). However, the whole point of a greedy search is to reduce compute / training time, so the higher our beam_width then the more time it takes to train (which perhaps explains why your default is set to 1).

With that said - I think I followed/implemented your recommendations correctly. I actually already had a NER model I trained using greedy search (the default method) against a set of 30k+ labelled records... so yesterday I "beam" retrained using your suggestion to set

nlp.entity.cfg['beam_width'] = 16
nlp.entity.cfg['beam_density'] = 0.0001

After about 300 iterations of this re-training, I applied this NER model to return only the entity with the highest Beam Search Normalized Confidence score across all 30k records. Then, I compared those least worse guesses to all 30k known labels. The result appears dismal... shown below as Parser 6, only 47% of the entities identified as having the highest beam confidence score actually matched their respective (correct) label.

image

In contrast Parser 3, who's model is the same as the one used for Parser 6 minus the extra beam training, generates Entity guesses at an accuracy of 92.8%. Ha!

This gives me the mixed feeling of being so close but so far away from my goal. That's because Parser3 is 93% accurate, but it doesn't allow me to know which 93% of my 30k records are correct. So I can't really use it in production on unlabeled data. Parser6 on the other hand does (almost) give me a threshold / guide for Entity guess rejection (based on it's normalized beam score), but on average that best guess is right only 47% of the time. So this is also unusable in production :confused:

Does that make sense? Considering your public position on this method of "Confidence" extraction - that it's not super great - my question to you is "does my outcome here track with your experience? Or do you perhaps think I've done something wrong?"

In the event I did everything right, and that this approach is perhaps a dead end, I suppose my last recourse here is to ask if you can provide an estimate as to when v3 may be released? So that I can use that version to get the confidence scores we seek. Thank you in advance.

Are you able to comment on my last reply @honnibal? Thank you in advance :slight_smile:

Hi Andrew,
Sorry for the delay replying, but please don't bump the thread like this next time.

Anyway, to answer your question:

This is almost right, but not quite. I'm not sure the distinction is really relevant, but for the record: The greedy model actually has a different objective from the beam model. The beam model assigns probability distributions over whole parses, while the greedy model attempts to predict which actions will be optimal given the current state. A model with beam width of 1 has poorly defined behaviour once it's made a few mistakes, as the model's objective doesn't encourage the parser to "make the best of a bad situation" the way the greedy objective does. The greedy objective therefore has a structural advantage when the search width is narrow.

Well, 93% accuracy is a pretty decent score. It's hard to know how well calibrated the probabilities will be in any case. You'll still be in the position of not "knowing" which of your examples are correct, even if you get confidence scores attached, as you don't know which cases the confidence scores will be wrong.

It's sad that the beam hasn't worked better for you, but the beam training is admittedly rather fiddly, there's more hyper-parameters to juggle and I'm not sure why it performs poorly on some problems. I think I've made some fixes in v3 that should help.

Even once v3 is released, you still might be in the same position though. So I think you'll be best off trying to implement some rules or heuristics that can filter out problematic errors from your data. As I said, even once you have scores, you won't really know whether they're correct.

1 Like

Hi @honnibal , Is there an updated link I can find for this? I am trying to generate scores for my detected entities. Current link is broken

Hi @thalish ,

There's been progress on this for spaCy v3, so we'll have updated workflows for this shortly after Prodigy v1.10 is released. I don't think it's a good investment of time to go back over the previous approaches, because they will be deprecated soon and they weren't that effective anyway.

@honnibal Thanks! Is there any particular page I should track so I don't miss the updates regarding this?

Is there a follow-up to this? I'm trying to understand the confidence scores for each annotation of my custom model output (e.g., MICROSOFT is a Company with 95% confidence. NATHAN is a PERSON with 99% confidence) so I can "ignore" assigned labels that are under a certain threshold from being presented in the output.

As of now, all I see is:

which discusses a SpaCy solution...SpanCategorizer. If I've already trained an NER model with the old Prodigy (three separate nonoverlapping labels), is it possible to reuse the .jsonl file for this purpose?
Thanks!

Edit: I also see this: Add SpanCategorizer component by honnibal · Pull Request #6747 · explosion/spaCy · GitHub but it doesn't provide much in the way of implementation guidance.

edit 2: I noticed in the newest Prodigy (w/Spacy 3+ support), there is Span Categorization: https://prodi.gy/docs/span-categorization. There is a new train --spancat recipe—I'm going to explore if I can reuse my existing NER .jsonl files. Will report back here. Sorry for this monologue :slight_smile:

1 Like