Evaluating Precision and Recall of NER

I’m asking something a little different. Like you I find it easiest to write my own P/R code. (And it appears that Scorer and nlp.evaluate are utilities that calculate P/R from the spaCy data structures.) But additionally I want to calculate P/R at a given threshold. So the model I’m evaluating is only considered to hypothesize an entity if its confidence score for that entity is above a given threshold t. The goal is to run this for a range of thresholds and draw an f-score ROC curve.

The part I can’t figure out is how to get the model to return scores for the entities it hypothesizes. I thought that was what the cats attribute was for, but that doesn’t behave the way I’d expect.

>>> nlp = spacy.load("en_core_web_lg")
>>> doc = nlp("This is America.")
>>> [entity.label_ for entity in doc.ents]
['GPE']
>>> doc.cats
{}

The documentation and recipe code makes it looks like the EnityRecognizer is what I want. You initialize it with a model and then it returns entities and scores, but I not sure what to pass as input to EnityRecognizer.

That last evaluate function I wrote above does everything I want except it it uses all the entities a model hypothesizes to calculate the score. I can’t figure out how to select just that subset of entities that have a score > t.