Annotation JSON

Hi,

I'm getting a weird JSON back. For some of the spans there are additional keys (text, source and _input_hash) whereas for others these do not appear.

Regarding the _input_hash, I'm not sure what value I could gain from it and why is it exactly the same _input_hash as the one under the meta key.

Regarding the _task_hash, I see it can be repeated. I'm also not so sure how to gain value from it.

Regarding the timestamp, how exactly is it interpreted? What time units does it show?

I'm attaching a snippet of an annotation JSON

Thank you!

hi @NNN!

Have you seen the Prodigy documentation on _input_hash and _task_hash?

These hashes are used to identify deduplication (e.g., whether two examples are entirely different, different questions about the same input, e.g. text, or the same question about the same input).

Here are details about each hash:

Hash Type Description
_input_hash int Hash representing the input that annotations are collected on, e.g. the text, image or html. Examples with the same text will receive the same input hash.
_task_hash int Hash representing the “question” about the input, i.e. the label, spans or options. Examples with the same text but different label suggestions or options will receive the same input hash, but different task hashes.

Prodigy uses these behind the scene to account for deduplications, so you can ignore them (however they can be helpful in tracking down the road).

This is a unixtime stamp. There are python converters that can help in converting this to time readable formats.

Let me know if this answers your questions or if you have any further questions!

1 Like

Hi Ryan,

Thanks so much for your answer. I apologise for my late response.

1.) Hashes - understood.

2.) Unclear - I still don't understand why for some of the spans there are additional keys (text , source and _input_hash ) whereas for others these do not appear (see the attached JSON snippet in my original question). There we can see that the spans in indices 1 and 2 have additional key-values (text, source, and _input_hash) while all other spans only have start, end, token_start, token_end and label.

Why is it that some spans have additional information? Is that arbitrary?

Thanks!!

hi @nnn!

Thanks for your follow up!

Good point! What recipe did you use to create those annotations, specifically the "COUNTRY" and "JOB_TITLE" spans?

I suspect you used a correct recipe (either ner.correct or span.correct). The highlighted examples (with the extra keys for text, source, and input_hash) is the normal behavior for model suggested annotations from a correct recipe.

Perhaps the other spans (don't have the extra keys) were created with the same recipe, but are "manual" annotations (i.e., you only highlighted) and weren't model suggested since the en_core_web_lg doesn't have the custom entities ("COUNTRY" and "JOB_TITLE").

Said differently, the extra keys (text, source, and input_hash) are created when annotated using model assisted correction.

If you had a trained NER model that had all the entities and were model suggestions, then you would have all the keys/data.

One caveat: it is possible to not have these extra fields for entity types that were in your model (e.g., an ORG) because there could be entities that you manually created and weren't model suggestions.

It's not required to train so for purposes of training, it's arbitrary.

However, the data does identify the source of the suggestion (e.g., model). Also, having this info distinguishes it as model suggested ("gold" annotations) because it reflects added confidence that the model would select this label (i.e., the data/model are consistent). So in that way, having this data would tell you this data is more important than manual annotations.

Just curious, have you trained a model by updating the original (e.g., --base-model en_core_web_lg to update) with both your custom entities and fine-tuned entities from en_core_web_lg?

If so, since you're mixing old and new entity types, make sure to account for potential catastrophic forgetting:

I hope this answers your question and let us know if you have other questions!