Work around for Inner-annotator agreement in currently available Prodigy?

We're considering purchasing a license for Prodigy however need to be able to calculate inner-annotator agreement to understand the complexity of our annotation task as we iterate on it.

I know this is coming for Prodigy Teams, however are there workarounds currently available so we don't have to build the functionality ourselves? This may be the difference between our going this route or needing to go elsewhere. We'd prefer to stay here. We're all Python hackers!

Hi and yay for Python hacking! :nerd_face::snake:

The main thing you have to do yourself is the actual implementation of how you want to calculate the agreement. (Of course, there's not one easy answer and it depends on what you need.) Everything else should already be there.

To collect annotations from multiple annotators, you can run separate instances with different named datasets, or use named multi-user sessions. You can then export the data and compare it to examples you already know the answers to, or use the review recipe (UI and data examples here) to re-annotate (a subset of) the data. When you stream in the data to annotate, you could also mix in examples you already know the answer to (and maybe add a key to the JSON record that makes it easy to filter them out later after they have been annotated).

At the end of it, you have multiple annotations on the same data (distinguishable by their hashes that Prodigy assigns), the associated dataset names and/or session IDs corresponding to who annotated it, and the correct answer.

You can then calculate who agreed with the correct answer, how individual annotators agreed with each other and whether there are any outliers. For binary annotations, that's reasonably simple, because you only have two options: accept or reject. If the answer is different, that's a disagreement. For more fine-grained annotations, that's a little more complex – for instance, if you're collecting manual NER annotations, you may want to count false/true positives/negatives as well (just like you do when you evaluate a model).