Gold notation, Test/Eval set for already trained model

Hi Guys,

In step one I train a model with gold annotation (almost 6500 annotations) from scratch and the model training is fine and I have the final model based on my annotations, now in step 2 I make a new piece of text and make gold annotations and saved the gold annotations. I want to test my prodigy model trained on step one vs this new gold annotation to see that how accurate the model is. I want to know that do you have any way/recipe for this action?


If you want to use the new gold standard evaluation set to evaluate during training, you can pass it in as the --eval-id argument to ner.batch-train.

If you only want to evaluate an already trained model, you could use a custom recipe like this:

In the above version, it takes the name of the dataset containing your evaluation examples, and the model you trained on your training examples. It then outputs the results.

Thanks Ines for your reply, I pass the --eval-id argument to ner.batch-train and during the raining I got it, I will try the custom recipe today.

I have a question regarding the --eval-id, is there any possibilities we just print out to a list the miss and wrong entities when we have an already evaluated test set. Like a simple csv with the first column a correct match, second column the wrong pick up and the last column for instance is the miss entities. Do we need to make our own recipe for that or we could modify the ner batch train to print out the csv for us?


We had a built-in recipe that did something similar to that during an early beta, but we had too many NER recipes so we consolidated things to avoid confusion.

If you want to get a quick readable summary, you might find the prodigy.components.printers.pretty_print_ner function useful. If you mark the spans with an answer key, which should have a value in "accept", "reject" and "ignore", the spans will be coloured by correctness. I would set the correct predictions to have accept, and false predictions to have reject. You could list the false negatives at the end of the text as well (these might overlap with the predicted annotations, so you can’t easily show them in-line).

The loop to run the model and compare against the gold-standard should be pretty simple. You can have a look at my sample code for calculating precision/recall/F-score evaluation figures in this thread for reference: Recall and Precision (TN, TP, FN, FP)