I'm wondering how I should annotate correctly when using the ner.correct recipe. We're using the pre-trained en_core_web_trf model with some known NER labels (such as PERSON, PRODUCT, ORG) while also adding a few custom entities of our own (e.g., ADDRESS).
Prodigy's guidelines suggest you should reject partial NER classifications and be strict with it. So, if my sentence was "The new iPhone X is expensive", but only "iPhone" was marked by the model as PRODUCT, I should be strict and hit reject. I'm wondering, is it also possible to simply change the marking in the UI such that it includes "X" inside and then hit accept? Would it be the same?
How about sentences that mislabel some span with a wrong entity label. For example, suppose "Siri" was mislabeled as PERSON instead of PRODUCT? Should I reject or remove the PERSON label and mark it as PRODUCT in the UI and then accept? What would be the difference?
Additionally, how should I treat sentences with more than one named entity where some are correct and others are not? Should I accept/reject or change it myself in the UI?
For all cases you've mentioned, it is advisable to correct the mistake first and hit ACCEPT. The only time we should hit REJECT is when we cannot verify if the entities are correct or not (maybe the tokenization is weird, maybe the data is corrupted, etc.).
After correcting your samples, you can train a model using prodigy train.
If you hit REJECT on a sample that can still be corrected, then you are losing valuable data for model training.
Yes, that's correct – in this case, you would remove the label ORG and accept the example. The annotation you're creating here will then tell later tell the model during training that this example contains no entities, which is what you want
(Unless this example contains broken markup or is an example that you don't want to include because it's not representative. You can then hit reject or ignore.)