Hello,
I have a NER model that includes a custom matcher component I created in spacy. Before I start training a neural NER, I want to see how well a rule-based approach is doing. I created a valuation dataset using ner.eval (which was awesomely easy!), but I’m having trouble finding a way to test my rule-based model against that data. After reading the docs and watching some videos, I understand how to evaluate a neural NER model, but I wasn’t able to find a simple way of evaluating a rule-based matcher. I’m wondering if I am missing a simple way to do that?
Below is some information about my use case, which might or might not provide some relevant context.
A part of my task is to look for mentions of specific performance metrics in corporate earnings reports and classify them according to whether they are explicitly defined according to Generally Accepted Accounting Practices (GAAP), explicitly defined as not following GAAP, or whether there is no reference to GAAP at all. That is probably confusing, so here is an example.
I want the word “earnings” in the sentences “On the GAAP basis, the earnings were $1 per share” to be assigned entity “earnings_gaap”, the word “earnings” in sentence “On the non-GAAP basis, the earnings were $1 per share” to be assigned entity “earnings_non_gaap” and the word “earnings” in sentence “The earnings were $1 per share” to be assigned entity “earnings_non_specified”.
The task appears to be well suited for rule-based matching. I have created a custom matcher in spacy that finds mentions of “earnings” and then looks at the context before and after the mention for markers associated with GAAP / non-GAAP reporting. It seems to work reasonably well, but there are a lot of specific patterns in data I need to account for.
I’m not quite sure what to do after creating an evaluation dataset with ner.eval. Evaluation in Prodigy seems to be tied to training a neural NER model (in ner.batch_train, for example), but I suppose there might be way more suitable for my case. ner.compare might be the way to go, but I don’t quite understand how get the required inputs.
Could you please point out some relevant resources? Thank you!