After some experimentation I think I figured out how to use EntityRecognizer
.
>>> import spacy
>>> nlp = spacy.load("en_core_web_lg")
>>> from prodigy.models.ner import EntityRecognizer
>>> r = EntityRecognizer(nlp, label=["PERSON", "GPE"])
>>> text = "Henry ford was born in Michigan."
>>> list(r([{"text":text, "_input_hash":hash(text)}]))
[(0.7973351944856613,
{'_input_hash': 1567318883,
'_task_hash': 2143133895,
'meta': {'score': 0.7973351944856613},
'spans': [{'end': 31,
'input_hash': -6212844767141395898,
'label': 'GPE',
'rank': 0,
'score': 0.7973351944856613,
'source': 'core_web_lg',
'start': 23,
'text': 'Michigan'}],
'text': 'Henry ford was born in Michigan.'}),
(0.7223527741995048,
{'_input_hash': 1567318883,
'_task_hash': -1467116620,
'meta': {'score': 0.7223527741995048},
'spans': [{'end': 5,
'input_hash': -6212844767141395898,
'label': 'PERSON',
'rank': 0,
'score': 0.7223527741995048,
'source': 'core_web_lg',
'start': 0,
'text': 'Henry'}],
'text': 'Henry ford was born in Michigan.'}),
(0.005112707458942434,
{'_input_hash': 1567318883,
'_task_hash': -2128614395,
'meta': {'score': 0.005112707458942434},
'spans': [{'end': 5,
'input_hash': -6212844767141395898,
'label': 'GPE',
'rank': 4,
'score': 0.005112707458942434,
'source': 'core_web_lg',
'start': 0,
'text': 'Henry'}],
'text': 'Henry ford was born in Michigan.'}),
(0.2479421936178019,
{'_input_hash': 1567318883,
'_task_hash': -304378400,
'meta': {'score': 0.2479421936178019},
'spans': [{'end': 10,
'input_hash': -6212844767141395898,
'label': 'PERSON',
'rank': 1,
'score': 0.2479421936178019,
'source': 'core_web_lg',
'start': 0,
'text': 'Henry ford'}],
'text': 'Henry ford was born in Michigan.'})]
The call to EntityRecognizer
's default function fails if _input_hash
is not in the input. I assume _input_hash
is a way for Prodigy to avoid doing redundant work on identical text. Is this correct? If so, wouldn’t it be better to have hash(text)
be the default value?
Also, the reason I’m using EntityRecognizer
is so that I can get entity predictions along with confidence scores. (I’m using these to draw a ROC curve of f-scores.) If I can help it, I’d prefer the package I’m writing to only have dependencies on spacy
and not prodigy
. Is there a way to get named entity confidence scores directly from the spacy
API?