hi @nnn!
Thanks for your follow up!
Good point! What recipe did you use to create those annotations, specifically the "COUNTRY" and "JOB_TITLE" spans?
I suspect you used a correct
recipe (either ner.correct
or span.correct
). The highlighted examples (with the extra keys for text
, source
, and input_hash
) is the normal behavior for model suggested annotations from a correct
recipe.
Perhaps the other spans (don't have the extra keys) were created with the same recipe, but are "manual" annotations (i.e., you only highlighted) and weren't model suggested since the en_core_web_lg
doesn't have the custom entities ("COUNTRY"
and "JOB_TITLE"
).
Said differently, the extra keys (text
, source
, and input_hash
) are created when annotated using model assisted correction.
If you had a trained NER model that had all the entities and were model suggestions, then you would have all the keys/data.
One caveat: it is possible to not have these extra fields for entity types that were in your model (e.g., an ORG
) because there could be entities that you manually created and weren't model suggestions.
It's not required to train so for purposes of training, it's arbitrary.
However, the data does identify the source of the suggestion (e.g., model). Also, having this info distinguishes it as model suggested ("gold" annotations) because it reflects added confidence that the model would select this label (i.e., the data/model are consistent). So in that way, having this data would tell you this data is more important than manual annotations.
Just curious, have you trained a model by updating the original (e.g., --base-model en_core_web_lg
to update) with both your custom entities and fine-tuned entities from en_core_web_lg
?
If so, since you're mixing old and new entity types, make sure to account for potential catastrophic forgetting:
I hope this answers your question and let us know if you have other questions!