Hi! If you're training from manually created annotation, the evaluation all happens within spaCy and doesn't depend on Prodigy. spaCy uses a very standard NER evaluation. If you're working with spaCy v2.x, you can view the code here:
For spaCy v3.x, it's here:
If you want to do a comparative evaluation, you can also just run both your models over your evaluation data and then calculate the accuracy however you want to, and consistently for both evaluations.
Some thing to keep in mind here: if you're using a non-spaCy model with a tokenizer that doesn't preserve the original text, this may impact your evaluation. It probably also makes sense to train with spaCy v3 directly (you can use prodigy data-to-spacy
and spacy convert
to convert your annotations), so you can train a transformer-based that's more directly comparable to another model initialised with transformer weights. Otherwise, your evaluation might not be very meaningful.