Hi! Sorry if this was confusing – the idea of the
ner.eval-ab recipe is that it lets you run a quick “live evaluation” with two models, by comparing the output on the given input data.
So instead of having to create the gold-standard evaluation set from scratch, you can quickly click through a bunch of examples and already get an idea of how your models are performing. Because the feedback you give is binary (e.g. green or red), the evaluation process also lets you capture which analysis is better and which model’s output you or the annotator preferred overall. Even two models with similar accuracy scores can produce different parses – and one model’s analysis could be much better thant the other’s, even if they both make the same amount of mistakes in total.
tl;dr: Yes, the quickest way to use
ner.eval-ab would probably be to extract the texts from your existing set and load it in as the input data (fourth argument) and use a new dataset to store the new AB annotations you create with the recipe. I’d recommend starting off with a few hundred AB annotations and repeating the process every once in a while as you update and train new models