Detailed evaluation of NER model trained from Prodigy annotations

Hi everyone,
I have been using Prodigy for a couple of weeks now and finding it extremely useful and intuitive. The ability to train spaCy models directly is also a really nice feature.
I am, however, a little stuck in my process of annotation/validation of models predictions. I follow the following flow:

  • Create a few terms, use sense2vec to enrich that list
  • Make a few annotations
  • Train model
  • Iterate with more annotations and further training.

However, I'm finding it quite hard to evaluate the model beyond the top line metrics of training. Usually, if this were a scikit-learn model, I might be able to load the ground truth labels, score using the model, and then explore the records where there is a mismatch. This is usually helpful to understand if the model might be actually picking up additional positive examples I didn't label.

I can't seem to find a straight forward way of doing a similar analysis in either Prodigy or spaCy. So my eyeball eval flow looks like:

  • db-export dataset with gold annotations
  • load into spaCy trained model
  • iterate over exported annotations and load the "text" into spaCy Doc objects, extract the entities.
  • extract entities from annotations
  • load both entities from annotations and from model predictions to dataframe.

Then I can inspect the results and see where the model is either predicting additional valid examples, which examples is finding hard to match, etc.

However, the above seems rather an involved way of carrying out the process, maybe there's a better way?
Any help is really appreciated!

Thank you!

Hi! If your goal is to put together a dataframe and explore the predictions, a simpler solution would be to run data-to-spacy to export your annotations in spaCy's format. You can use this for training, and also for easy access to the annotations.

Under the hood, the .spacy files are just collections of Doc objects. So if you load them back in, you get a Doc object with doc.ents, just like you'd get from running a model over your text: You can then compare those entities to the predictions by one or more models on the same text, and store the results or differences in a dataframe.

If you prefer a more visual approach, you could also build a little annotation workflow that loads in your existing annotations and adds entries to the "spans" for the model's predictions, e.g. using different labels like model:ORG and data:ORG etc., maybe even with different custom colours for the different label types. If you use the spans_manual UI to render it, you can view multiple overlapping spans and view how the predictions compare to the original data. Even if you just skip through the results and don't actually annotate anything, it could be a nice way to visualize the results.

Hi Ines,
Thank you for your response, this is indeed very helpful. I was looking into the Example and Scorer from spaCy and now that you mentioned the data-to-spacy being a collection of Docs maybe indeed I could use to build Examples.
Re visual approach, this is a neat suggestion, is there an example from docs re how a custom workflow can be built? Do you mean by this a custom recipe?
Appreciate the help and thanks again for your helpful response :slight_smile:

Yes, exactly! You can read more about custom recipes here: I think this recipe should be a really good starting point and it already does most of what you want:

The only difference in your case would be that you want to use "view_id": "spans_manual" to support overlapping spans. And instead of resetting the spans = [], you'd add the predicted spans on top of the spans that are already annotated in the input data.

To distinguish the predicted spans, you could use a label like f"MODEL:{ent.label_}" (or maybe just M:, since that's shorter). If you want it to look fancier, you can also add some custom label colours for the different labels in your data, e.g. one version for MODEL:{label} and one for the regular label: So your annotated labels could be blue, and the predicted labels could be red, or something like that :slight_smile:

1 Like

Hi Ines,
Thank you for the help! I now have a recipe that indeed shows me both the predicted labels and my annotations :slight_smile:

Only one outstanding point is the dataset parameter for the recipe. Since the recipe doesn't need to save new annotations, I was wondering if there's a way to boot up without the dataset, but when I remove all references to it from the recipe, I get this exception, which seems to be coming from the library itself.

✘ Invalid components returned by recipe 'ner.model-evaluation'

dataset   field required

{'view_id': 'spans_manual', 'stream': <generator object make_tasks at 0x13dbc4ac0>, 'config': {'lang': 'en', 'labels': ['DATA_ENTRY']}}

Seems like a pain to have to have a dummy dataset ¯_(ツ)_/¯

Awesome, thanks for updating :tada:

If you don't want to save anything to a dataset, you can just set "dataset": False explicitly – sorry, this might be slightly under-documented at the moment because it's a fairly rare use case.

that works nicely! :tada:

Re documentation, agree, not common case. I'm wondering myself whether this inspection will not actually make me want to add more cases that perhaps the model is identifying where I lacked the labels.
The dataset will be back :sunglasses:

Thanks for the help and awesome libraries!

1 Like