Annotation JSON

Hi,

I'm getting a weird JSON back. For some of the spans there are additional keys (text, source and _input_hash) whereas for others these do not appear.

Regarding the _input_hash, I'm not sure what value I could gain from it and why is it exactly the same _input_hash as the one under the meta key.

Regarding the _task_hash, I see it can be repeated. I'm also not so sure how to gain value from it.

Regarding the timestamp, how exactly is it interpreted? What time units does it show?

I'm attaching a snippet of an annotation JSON

Thank you!

hi @NNN!

Have you seen the Prodigy documentation on _input_hash and _task_hash?

These hashes are used to identify deduplication (e.g., whether two examples are entirely different, different questions about the same input, e.g. text, or the same question about the same input).

Here are details about each hash:

Hash Type Description
_input_hash int Hash representing the input that annotations are collected on, e.g. the text, image or html. Examples with the same text will receive the same input hash.
_task_hash int Hash representing the “question” about the input, i.e. the label, spans or options. Examples with the same text but different label suggestions or options will receive the same input hash, but different task hashes.

Prodigy uses these behind the scene to account for deduplications, so you can ignore them (however they can be helpful in tracking down the road).

This is a unixtime stamp. There are python converters that can help in converting this to time readable formats.

Let me know if this answers your questions or if you have any further questions!

1 Like

Hi Ryan,

Thanks so much for your answer. I apologise for my late response.

1.) Hashes - understood.

2.) Unclear - I still don't understand why for some of the spans there are additional keys (text , source and _input_hash ) whereas for others these do not appear (see the attached JSON snippet in my original question). There we can see that the spans in indices 1 and 2 have additional key-values (text, source, and _input_hash) while all other spans only have start, end, token_start, token_end and label.

Why is it that some spans have additional information? Is that arbitrary?

Thanks!!

hi @nnn!

Thanks for your follow up!

Good point! What recipe did you use to create those annotations, specifically the "COUNTRY" and "JOB_TITLE" spans?

I suspect you used a correct recipe (either ner.correct or span.correct). The highlighted examples (with the extra keys for text, source, and input_hash) is the normal behavior for model suggested annotations from a correct recipe.

Perhaps the other spans (don't have the extra keys) were created with the same recipe, but are "manual" annotations (i.e., you only highlighted) and weren't model suggested since the en_core_web_lg doesn't have the custom entities ("COUNTRY" and "JOB_TITLE").

Said differently, the extra keys (text, source, and input_hash) are created when annotated using model assisted correction.

If you had a trained NER model that had all the entities and were model suggestions, then you would have all the keys/data.

One caveat: it is possible to not have these extra fields for entity types that were in your model (e.g., an ORG) because there could be entities that you manually created and weren't model suggestions.

It's not required to train so for purposes of training, it's arbitrary.

However, the data does identify the source of the suggestion (e.g., model). Also, having this info distinguishes it as model suggested ("gold" annotations) because it reflects added confidence that the model would select this label (i.e., the data/model are consistent). So in that way, having this data would tell you this data is more important than manual annotations.

Just curious, have you trained a model by updating the original (e.g., --base-model en_core_web_lg to update) with both your custom entities and fine-tuned entities from en_core_web_lg?

If so, since you're mixing old and new entity types, make sure to account for potential catastrophic forgetting:

I hope this answers your question and let us know if you have other questions!

1 Like

Hi Ryan,

Thanks for your reply.

I used the ner.correct recipe and added custom labels.
Thanks so much for the clear explanation - understood.

We're using en_coreweb_lg` for the predictions in Prodigy but later converting the output to BIO format to train the model using Flair.

A strange thing though, we getting Prodigy predictions for our custom entities. Is that normal? Could en_core_web_lg predict on untrained entities?

Many thanks for everything! :pray:t3:

Sorry, I don't fully understand.

Is it that the en_core_web_lg ner model predicts your custom ner labels?

That shouldn't happen. You can view what are the labels in your ner component by running:

import spacy
nlp = spacy.load("en_core_web_lg")
nlp.get_pipe('ner').labels
# ('CARDINAL', 'DATE', 'EVENT', 'FAC', 'GPE', 'LANGUAGE', 'LAW', 'LOC', 'MONEY', 'NORP', 'ORDINAL', 'ORG', 'PERCENT', 'PERSON', 'PRODUCT', 'QUANTITY', 'TIME', 'WORK_OF_ART')

These are the only labels from this component.

Let me know if I misunderstood your problem.

Hi Ryan,

Yes, or at least so it seems :slight_smile: (snippet attached)

The only build-in labels we're using from en_core_web_lg are PERSON and ORG and PRODUCT; the rest are custom.

Thank you

Just to make sure I double check, can you run this:

import spacy
nlp = spacy.load("en_core_web_lg")
text = "[Provide example text from similar behavior]"
doc = nlp(text)
spacy.displacy.serve(doc, style="ent")

If you're still seeing your custom entities, could you try to disable other components to keep only ner?