Hi and yay for Python hacking!
The main thing you have to do yourself is the actual implementation of how you want to calculate the agreement. (Of course, there's not one easy answer and it depends on what you need.) Everything else should already be there.
To collect annotations from multiple annotators, you can run separate instances with different named datasets, or use named multi-user sessions. You can then export the data and compare it to examples you already know the answers to, or use the
review recipe (UI and data examples here) to re-annotate (a subset of) the data. When you stream in the data to annotate, you could also mix in examples you already know the answer to (and maybe add a key to the JSON record that makes it easy to filter them out later after they have been annotated).
At the end of it, you have multiple annotations on the same data (distinguishable by their hashes that Prodigy assigns), the associated dataset names and/or session IDs corresponding to who annotated it, and the correct answer.
You can then calculate who agreed with the correct answer, how individual annotators agreed with each other and whether there are any outliers. For binary annotations, that's reasonably simple, because you only have two options: accept or reject. If the answer is different, that's a disagreement. For more fine-grained annotations, that's a little more complex – for instance, if you're collecting manual NER annotations, you may want to count false/true positives/negatives as well (just like you do when you evaluate a model).