Should _input_hash be required on the input to EntityRecognizer?

wpm · January 9, 2018, 11:48pm

After some experimentation I think I figured out how to use EntityRecognizer.

>>> import spacy
>>> nlp = spacy.load("en_core_web_lg")
>>> from prodigy.models.ner import EntityRecognizer
>>> r = EntityRecognizer(nlp, label=["PERSON", "GPE"])
>>> text = "Henry ford was born in Michigan."
>>> list(r([{"text":text, "_input_hash":hash(text)}]))
[(0.7973351944856613,
  {'_input_hash': 1567318883,
   '_task_hash': 2143133895,
   'meta': {'score': 0.7973351944856613},
   'spans': [{'end': 31,
     'input_hash': -6212844767141395898,
     'label': 'GPE',
     'rank': 0,
     'score': 0.7973351944856613,
     'source': 'core_web_lg',
     'start': 23,
     'text': 'Michigan'}],
   'text': 'Henry ford was born in Michigan.'}),
 (0.7223527741995048,
  {'_input_hash': 1567318883,
   '_task_hash': -1467116620,
   'meta': {'score': 0.7223527741995048},
   'spans': [{'end': 5,
     'input_hash': -6212844767141395898,
     'label': 'PERSON',
     'rank': 0,
     'score': 0.7223527741995048,
     'source': 'core_web_lg',
     'start': 0,
     'text': 'Henry'}],
   'text': 'Henry ford was born in Michigan.'}),
 (0.005112707458942434,
  {'_input_hash': 1567318883,
   '_task_hash': -2128614395,
   'meta': {'score': 0.005112707458942434},
   'spans': [{'end': 5,
     'input_hash': -6212844767141395898,
     'label': 'GPE',
     'rank': 4,
     'score': 0.005112707458942434,
     'source': 'core_web_lg',
     'start': 0,
     'text': 'Henry'}],
   'text': 'Henry ford was born in Michigan.'}),
 (0.2479421936178019,
  {'_input_hash': 1567318883,
   '_task_hash': -304378400,
   'meta': {'score': 0.2479421936178019},
   'spans': [{'end': 10,
     'input_hash': -6212844767141395898,
     'label': 'PERSON',
     'rank': 1,
     'score': 0.2479421936178019,
     'source': 'core_web_lg',
     'start': 0,
     'text': 'Henry ford'}],
   'text': 'Henry ford was born in Michigan.'})]

The call to EntityRecognizer's default function fails if _input_hash is not in the input. I assume _input_hash is a way for Prodigy to avoid doing redundant work on identical text. Is this correct? If so, wouldn’t it be better to have hash(text) be the default value?

Also, the reason I’m using EntityRecognizer is so that I can get entity predictions along with confidence scores. (I’m using these to draw a ROC curve of f-scores.) If I can help it, I’d prefer the package I’m writing to only have dependencies on spacy and not prodigy. Is there a way to get named entity confidence scores directly from the spacy API?

ines · January 9, 2018, 11:54pm

Prodigy's EntiyRecognizer model was developed specifically for Prodigy, so it's also a little more specific in terms of the input it expects.

The input hash is generated from the input data, e.g. the text or the image and lets Prodigy distinguish between tasks with the same input (but potentially different labels or spans). Additionally, Prodigy also generates a task hash based on the input hash and the features you're annotating, e.g. the spans, labels etc. This lets you distinguish between exact questions. You can also use the set_hashes helper to take care of the hashing for you:

from prodigy import set_hashes

examples = [set_hashes(eg) for eg in examples]

You can also set the additional keyword arguments input_keys and task_keys, both lists of the keys you want to take into account when hashing. For example, input_keys=('text', 'custom_text'). The full docs are available in the PRODIGY_README.html.

Yes, but this is a little more complex. @honnibal wrote a more detailed reply on this here:

Topic		Replies	Views
Does prodigy.models.ner.EntityRecognizer constructor modify the underlying nlp model? usage , ner , done , solved	5	658	July 8, 2021
two EntityRecognizers Getting Started ner	4	178	November 28, 2023
Manual Input of Entities to a prodigy database usage , ner , solved	5	430	July 10, 2021
EntityRecognizer.make_best(silver_data) seems to ignore entities in silver data bug , ner	1	674	July 17, 2019
Question about EntityRecognizer usage , ner	5	811	July 29, 2020

Should _input_hash be required on the input to EntityRecognizer?

Related topics