These hashes are used to identify deduplication (e.g., whether two examples are entirely different, different questions about the same input, e.g. text, or the same question about the same input).
Here are details about each hash:
Hash
Type
Description
_input_hash
int
Hash representing the input that annotations are collected on, e.g. the text, image or html. Examples with the same text will receive the same input hash.
_task_hash
int
Hash representing the “question” about the input, i.e. the label, spans or options. Examples with the same text but different label suggestions or options will receive the same input hash, but different task hashes.
Prodigy uses these behind the scene to account for deduplications, so you can ignore them (however they can be helpful in tracking down the road).
Thanks so much for your answer. I apologise for my late response.
1.) Hashes - understood.
2.) Unclear - I still don't understand why for some of the spans there are additional keys (text , source and _input_hash ) whereas for others these do not appear (see the attached JSON snippet in my original question). There we can see that the spans in indices 1 and 2 have additional key-values (text, source, and _input_hash) while all other spans only have start, end, token_start, token_end and label.
Why is it that some spans have additional information? Is that arbitrary?
Good point! What recipe did you use to create those annotations, specifically the "COUNTRY" and "JOB_TITLE" spans?
I suspect you used a correct recipe (either ner.correct or span.correct). The highlighted examples (with the extra keys for text, source, and input_hash) is the normal behavior for model suggested annotations from a correct recipe.
Perhaps the other spans (don't have the extra keys) were created with the same recipe, but are "manual" annotations (i.e., you only highlighted) and weren't model suggested since the en_core_web_lg doesn't have the custom entities ("COUNTRY" and "JOB_TITLE").
Said differently, the extra keys (text, source, and input_hash) are created when annotated using model assisted correction.
If you had a trained NER model that had all the entities and were model suggestions, then you would have all the keys/data.
One caveat: it is possible to not have these extra fields for entity types that were in your model (e.g., an ORG) because there could be entities that you manually created and weren't model suggestions.
It's not required to train so for purposes of training, it's arbitrary.
However, the data does identify the source of the suggestion (e.g., model). Also, having this info distinguishes it as model suggested ("gold" annotations) because it reflects added confidence that the model would select this label (i.e., the data/model are consistent). So in that way, having this data would tell you this data is more important than manual annotations.
Just curious, have you trained a model by updating the original (e.g., --base-model en_core_web_lg to update) with both your custom entities and fine-tuned entities from en_core_web_lg?
If so, since you're mixing old and new entity types, make sure to account for potential catastrophic forgetting:
I hope this answers your question and let us know if you have other questions!
Just to make sure I double check, can you run this:
import spacy
nlp = spacy.load("en_core_web_lg")
text = "[Provide example text from similar behavior]"
doc = nlp(text)
spacy.displacy.serve(doc, style="ent")
If you're still seeing your custom entities, could you try to disable other components to keep only ner?