Eval AB confusing interface

Hi there,

I wanted to compare two models using ner.eval-ab recipe and I found the interface very confusing, as it shows the same UI buttons at the bottom Accept(a), Reject(x) and I could not understand what’s their meaning here. As the color of buttons matches the ‘answers’ options that are given to choose between - I didn’t know if their purpose is to choose between the answers Green/Red or to accept/reject the best answer chosen by prodigy.

prodigy ner.eval-ab [dataset] [before_model] [after_model] [source] [--api] [--loader] [--label] [--exclude] [--unsegmented]

If a color is assigned to a specific model:

  • the [before_model] is the red one, [after_model] is the green one - is this color scheme right? if yes, then probably a legend of colors would be helpful and maybe changing the accept/reject buttons to A/B.

If colors are not assigned to models, how to know which model predicted what?

Sorry if this was confusing. The idea behind the interface is that it randomises A and B and lets you conduct a somewhat blind evaluation. So the fact that you don’t know which model is which is actually a feature :wink:

The concept is this: You get shown two outputs and have to decide which one is better, green or red. Or, put differently: You get shown a preferred output and a dispreffered output and have to decide whether you agree (green) or not (red). In any case, you’ll be pressing the button whose color corresponds with the output you prefer.

Which output is chosen for A and B is randomised, to make it harder to accidentally bias the evaluation (for example, if you know which output was generated by the new fancy model you spent weeks working on, you might prefer it more often than you should). The A/B mapping is stored with the original task, so you’ll always be able to resolve the annotations back to the models. After you quite the server, Prodigy will also resolve the mappings and tell you which model you preferrred.

If you do need an evaluation process that shows you which output is which, you can always write your own recipe that uses the compare interface and doesn’t do any shuffling. You could also modify the ner.eval-ab recipe and remove that part – if you check out the code, you’ll see that it’s actually pretty straightforward.

1 Like

Thank you @ines, now I understand it - the documentation stated somehow the bias thing but not in a direct way as you did here.