Should _input_hash be required on the input to EntityRecognizer?

ner

(W.P. McNeill) #1

After some experimentation I think I figured out how to use EntityRecognizer.

>>> import spacy
>>> nlp = spacy.load("en_core_web_lg")
>>> from prodigy.models.ner import EntityRecognizer
>>> r = EntityRecognizer(nlp, label=["PERSON", "GPE"])
>>> text = "Henry ford was born in Michigan."
>>> list(r([{"text":text, "_input_hash":hash(text)}]))
[(0.7973351944856613,
  {'_input_hash': 1567318883,
   '_task_hash': 2143133895,
   'meta': {'score': 0.7973351944856613},
   'spans': [{'end': 31,
     'input_hash': -6212844767141395898,
     'label': 'GPE',
     'rank': 0,
     'score': 0.7973351944856613,
     'source': 'core_web_lg',
     'start': 23,
     'text': 'Michigan'}],
   'text': 'Henry ford was born in Michigan.'}),
 (0.7223527741995048,
  {'_input_hash': 1567318883,
   '_task_hash': -1467116620,
   'meta': {'score': 0.7223527741995048},
   'spans': [{'end': 5,
     'input_hash': -6212844767141395898,
     'label': 'PERSON',
     'rank': 0,
     'score': 0.7223527741995048,
     'source': 'core_web_lg',
     'start': 0,
     'text': 'Henry'}],
   'text': 'Henry ford was born in Michigan.'}),
 (0.005112707458942434,
  {'_input_hash': 1567318883,
   '_task_hash': -2128614395,
   'meta': {'score': 0.005112707458942434},
   'spans': [{'end': 5,
     'input_hash': -6212844767141395898,
     'label': 'GPE',
     'rank': 4,
     'score': 0.005112707458942434,
     'source': 'core_web_lg',
     'start': 0,
     'text': 'Henry'}],
   'text': 'Henry ford was born in Michigan.'}),
 (0.2479421936178019,
  {'_input_hash': 1567318883,
   '_task_hash': -304378400,
   'meta': {'score': 0.2479421936178019},
   'spans': [{'end': 10,
     'input_hash': -6212844767141395898,
     'label': 'PERSON',
     'rank': 1,
     'score': 0.2479421936178019,
     'source': 'core_web_lg',
     'start': 0,
     'text': 'Henry ford'}],
   'text': 'Henry ford was born in Michigan.'})]

The call to EntityRecognizer's default function fails if _input_hash is not in the input. I assume _input_hash is a way for Prodigy to avoid doing redundant work on identical text. Is this correct? If so, wouldn’t it be better to have hash(text) be the default value?

Also, the reason I’m using EntityRecognizer is so that I can get entity predictions along with confidence scores. (I’m using these to draw a ROC curve of f-scores.) If I can help it, I’d prefer the package I’m writing to only have dependencies on spacy and not prodigy. Is there a way to get named entity confidence scores directly from the spacy API?


(Ines Montani) #2

Prodigy’s EntiyRecognizer model was developed specifically for Prodigy, so it’s also a little more specific in terms of the input it expects.

The input hash is generated from the input data, e.g. the text or the image and lets Prodigy distinguish between tasks with the same input (but potentially different labels or spans). Additionally, Prodigy also generates a task hash based on the input hash and the features you’re annotating, e.g. the spans, labels etc. This lets you distinguish between exact questions. You can also use the set_hashes helper to take care of the hashing for you:

from prodigy import set_hashes

examples = [set_hashes(eg) for eg in examples]

You can also set the additional keyword arguments input_keys and task_keys, both lists of the keys you want to take into account when hashing. For example, input_keys=('text', 'custom_text'). The full docs are available in the PRODIGY_README.html.

Yes, but this is a little more complex. @honnibal wrote a more detailed reply on this here: