Hello, I'm trying to see if there is a way to output the average pairwise F1 score (through the iaa.metric.span recipe) on a per task basis, vs. on a per label basis across all tasks (what iaa.metric.span currently generates). For example, we have a trained span categorizer outputting the span categorizations for a set of given tasks where each task needs to be seen/corrected by 5 people. I want to compute the average pairwise F1 score across all labels for one given task at one time.
Is there any ways I might be able to do this?
the short answer would no, it's not possible to specify whether the metrics should be averaged over tasks or labels.
That said, IAA recipes are a new feature and we are on the lookout for possible improvements to the API.
I could imagine another viable "grouping" being by annotator. This of course wouldn't make sense for the doc level, probability based metrics, but it is definitely an option for the pairwise F1.
I was wondering, what kind of actionable insights are you looking for? Inferring the complexity of each task based on the agreement? Automatically selecting tasks for the adjudication? For that, you'd need to have access to each task's F1, is that what you're looking for? Not sure what can be inferred from an aggregate i.e. average F1 over all tasks' F1 that would be different from an aggregate over label-based F1