Evaluation of rule-based matching


I have a NER model that includes a custom matcher component I created in spacy. Before I start training a neural NER, I want to see how well a rule-based approach is doing. I created a valuation dataset using ner.eval (which was awesomely easy!), but I’m having trouble finding a way to test my rule-based model against that data. After reading the docs and watching some videos, I understand how to evaluate a neural NER model, but I wasn’t able to find a simple way of evaluating a rule-based matcher. I’m wondering if I am missing a simple way to do that?

Below is some information about my use case, which might or might not provide some relevant context.

A part of my task is to look for mentions of specific performance metrics in corporate earnings reports and classify them according to whether they are explicitly defined according to Generally Accepted Accounting Practices (GAAP), explicitly defined as not following GAAP, or whether there is no reference to GAAP at all. That is probably confusing, so here is an example.

I want the word “earnings” in the sentences “On the GAAP basis, the earnings were $1 per share” to be assigned entity “earnings_gaap”, the word “earnings” in sentence “On the non-GAAP basis, the earnings were $1 per share” to be assigned entity “earnings_non_gaap” and the word “earnings” in sentence “The earnings were $1 per share” to be assigned entity “earnings_non_specified”.

The task appears to be well suited for rule-based matching. I have created a custom matcher in spacy that finds mentions of “earnings” and then looks at the context before and after the mention for markers associated with GAAP / non-GAAP reporting. It seems to work reasonably well, but there are a lot of specific patterns in data I need to account for.

I’m not quite sure what to do after creating an evaluation dataset with ner.eval. Evaluation in Prodigy seems to be tied to training a neural NER model (in ner.batch_train, for example), but I suppose there might be way more suitable for my case. ner.compare might be the way to go, but I don’t quite understand how get the required inputs.

Could you please point out some relevant resources? Thank you!

You’re right that we don’t actually have a recipe for this built in, which is an oversight. Still, there’s some value in writing the evaluation code for these things, as it means you can make sure you’re able to get all the detail you need.

One way to implement evaluation functions is using sets. You then have:

true_positives = guesses.intersection(truth)
false_positives = guesses - truth
false_negatives = truth - guesses

precision = len(true_positives) / len(guesses)
recall = len(true_positives) / len(truth)
fscore = 2 * ((p * r) / (p + r + 1e-100))

There’s a helper for this in spacy.scorer: https://github.com/explosion/spaCy/blob/master/spacy/scorer.py#L7

When you make your sets, make sure that you’re representing the spans by the start and end offsets with the label, instead of just the text. It’s not so relevant in your case, but it covers you if you do have inputs with multiple annotations that have the same text content. A tuple (start, end, label) will be hashable, so you can store it in a set.

If you’re making the set over a whole dataset, you’ll also want to add in the input hash, to make sure you’re referring to the right examples. All up, it should be as easy as this:

def get_annotations(dataset):
    annotations = set()
    for eg in dataset:
        for span in eg["spans"]:
            annotations.add((span["start"], span["end"], span["label"]))
    return annotations

DB = connect()
truth = get_annotations(matcher_output)
guesses = get_annotations(DB.get_dataset(gold_annotations))
scores = spacy.scorer.PRFScore()
scores.score_set(guesses, truth)
print(scores.precision, scores.recall, scores.fscore)

Thank you very much for the the great support and an amazing tool!

Truth should come from the gold_annotations in the db :slight_smile:

1 Like