Proper way to calculate inter-annotator agreement for spans/ner?

How does one go about doing this process? I was thinking of using the kappa score, but is used for mutually exclusive classification not really for spans/ner. How would I go about computing a inter-annotator agreement score for a ner task where the annotators may have different number of annotations per documents?

Some ideas I was thinking about: find only the annotations that have overlap, then calculate absolute span matching (if they are the exact same span), percent overlap, and levenshtein distance. Any help would be greatly appreciated

hi @klopez!

Great question!

A couple of initial questions.

  • What's your end-goal for this metric? Is it to use a measure of label quality (e.g., if a lot of disagreement) to review such examples? Or is it something different?

  • Can you give some specifics on what type of text you're applying it? technical like medical or legal? or more general like news articles?

  • How many annotators will you have (a few like 2-3 or many 5+)? And will they all review the same ones or will each see different ones (e.g., every item is reviewed by three annotators but each annotator sees different examples)?

  • When you say overlap, I assume you mean the union of all annotated spans (e.g., all spans that had at least 1 annotator)?

I ask b/c these could factor in a difference in performance (e.g., very technical text with expert annotators may have a lot of overlap as shared knowledge of entities while open-ended tasks to non-experts may have very little overlap in spans).

This is an open question but I can provide you a few references.

As you realized, the core problem with Kappa is that it requires both positive and negative examples. The problem with NER is that it may not be clear what those negative examples are.

The pairwise F1 measure is a common alternative to Cohen's kappa for NER/spans. What's nice with this is it doesn't require negative examples. Hripcsak & Rothschild, 2005 was an early paper that provided the formula and rationale. Deleger et al., 2012 also provides a good outline too.

Grouin et al., 2011 also may help as they calculate Kappa, Scott's Pi or F-Measure but on a subset of terms with at least one annotator (e.g., pooling/span-overlap), only n-grams or noun-phrases only. They discuss some of the pros/cons of each approach.

Wang et al., 2021 had a variety of different ways to calculate overlap (quoted from supplemental materials):

  • Exact span matches, where two annotators identified exact the same Named Entity text spans.
  • Relaxed span matches, where Named Entity text spans from two annotators overlap.
  • Exact concept matches, where within agreed text span, annotators assigned exact same concept class.
  • Parent concept matches, where the concept class assigned by one annotator is the parent class of the one by the other annotator.
  • Superclass concept matches, where the two concept classes assigned belong to the same superclass.
  • Ambiguity concept matches, where one annotator assigned a semantic ambiguous class which includes the concept assigned by the other annotator.

So overall I think your solution is aligned but perhaps considering F-measure along with different ways to overlap may help depending on your context. Typically in these cases I would start as simple as possible and then only add the complexity when you notice issues with the simple metrics.

I've also asked a few colleagues to get their thoughts and will post them if they have additional ideas.

Thank you for your question!

Thank you ryanwesslen I actually just finished up implementing this method:

using f1 score right before I saw this message.
I followed along with Hripcsak & Rothschild, 2005
I dealt with annotations that had overlaps by combining (all annotators intervals), sorting, then merging the intervals that overlap. If there were annotations that did not overlap (1 annotator annotated something that another didnt) I set the values for the annotation at that interval to be the value set and -1. Here is an example where the label is converted to an float (or int) and if there is a label that is -1 (say at interval [57,78]) AG didnt label that, but DC did. Then perform the F1 on this table (this table represents the labels for 1 document). I then averaged it for all documents.

if there is a want/need for something like this I would be more than happy to contribute code

1 Like

This is great. No need to provide anything but you're welcome to post code snippets to help others.

What was most helpful was asking the question so hopefully this can help future Prodigy users. Thank you!