Calculating Inter-Annotator Agreement (IAA) for relationships

Hi,

I have relationships annotated with a custom recipe that used the relations interface to annotate NER + relations at the same time. How can I calculate the IAA for the relationship annotations?

I am not sure how to use Prodigy's iaa commands for relationships, or if that's a possibility.

Any advice you can provide me would be of great help.

Thanks!

Hi @ale,

The build-in IAA recipes currently consider annotations from either span/token or text/audio/image classification recipes. relations is not supported yet.

You would need a custom script to compute IAA for your dataset. The tricky bit about calculating IAA on a joint NER+relations annotations is that you need to consider both token-level (for NER) and document level (for relations) annotations and only compute relation agreement for examples where child and head spans of the relations are agreed upon.
For this reason, I would recommend you compute IAA for spans and relation labels separately.
This will help you understand where most disagreements come from i.e:

  1. compute agreement on NER spans - I recommend using pairwise F1 score for this. You will need to decide if you accept only strict agreement or allow for some span boundary errors.
  2. compute agreement on relation labels for agreed-upon spans - You can use Fleiss' Kappa (Cohen's Kappa if you only work with two annotators) or Krippendorff's Alpha (Python package)

You can of course combine the two agreements in some way to get the final score, but it's probably best to report the two numbers separately.

Thanks for your feedback @magdaaniol !

Indeed, I will keep IAA metrics for NER and relations separate. I will work on a custom script for the relations part.

On a related note, I used the recipe metric.iaa.span for the span IAA. Does this recipe take into account all examples that were either accepted, skipped, or rejected or does it only focus on accepted examples?

Also, quick feedback: The documentation for metric.iaa.span in Built-in Recipes does not list the partial flag. It is only mentioned in Annotation Metrics. It may be good to add it there too.

Thanks!

Hi @ale,

Does this recipe take into account all examples that were either accepted, skipped, or rejected or does it only focus on accepted examples?

The metric.iaa.span recipe takes into account only accepted examples. The reject and ignore cases are interpreted as if the annotator didn't provide the annotation to the given example at all.

Thanks so much for the feedback on the missing partial flag documentation! It's added now.