Do you have any details on the evaluation script used for NER models in prodigy? Is it a standard confusion matrix, or do you have a more elaborate script?
Is it possible to use a custom evaluation script? I need it on order to compare results obtained using prodigy with models reported in the scientific articles I’m following.
It depends on whether you’re evaluating based on binary annotations, or based on the fully-specified manual annotations.
If you’re evaluating the binary annotations, the accuracy score is based on how many of the accepted entities you got right, and how many predicted entities are inconsistent with the annotations (either because they cross a correct entity, or because they match a rejected entity). There will also be some predicted entities that can’t be evaluated.
If you’re evaluating the manual annotations, then yes the evaluation follows the standard precision, recall and F-measure metrics. Of course, it’ll probably be difficult to directly compare the scores against the scientific literature, as you’ll be using a different dataset. You can find figures comparing spaCy’s NER (which is what we use in Prodigy) using a standard methodology here: https://spacy.io/usage/facts-figures#ner-accuracy-ontonotes5