IAA metrics error

I am testing out the latest release (v1.14.4) and the new IAA metrics. I tried a few commands for a spans task, and I got the following errors. This dataset has been annotated by two annotators and contains multiple labels.

prodigy metric.iaa.span dataset:ner_task_v2 multilabel -l ACCOUNT,ACTIVITY,AMOUNT,BANK,CARDINAL,CURRENCY,DATE,FAC,FREQUENCY,GPE,LANGUAGE,ORDINAL,ORG,PERCENT,PERSON,STT_ERROR,VEHICLE
Using 17 label(s): ACCOUNT, ACTIVITY, AMOUNT, BANK, CARDINAL, CURRENCY, DATE,
FAC, FREQUENCY, GPE, LANGUAGE, ORDINAL, ORG, PERCENT, PERSON, STT_ERROR, VEHICLE
unrecognized arguments: multilabel
prodigy metric.iaa.doc dataset:ner_task_v2 multilabel -l ACCOUNT,ACTIVITY,AMOUNT,BANK,CARDINAL,CURRENCY,DATE,FAC,FREQUENCY,GPE,LANGUAGE,ORDINAL,ORG,PERCENT,PERSON,STT_ERROR,VEHICLE

ℹ Using 2 annotator IDs: ner_task_v2-annotator1, ner_task_v2-annotator2

✘ Requested labels: FREQUENCY, GPE, CARDINAL, FAC, BANK, STT_ERROR,
LANGUAGE, ACCOUNT, ORG, VEHICLE, AMOUNT, CURRENCY, ACTIVITY, PERCENT, PERSON,
DATE, ORDINAL were not found in the dataset. Found labels: .
prodigy metric.iaa.span dataset:ner_task_v2 -l ACCOUNT,ACTIVITY,AMOUNT,BANK,CARDINAL,CURRENCY,DATE,FAC,FREQUENCY,GPE,LANGUAGE,ORDINAL,ORG,PERCENT,PERSON,STT_ERROR,VEHICLE
Using 17 label(s): ACCOUNT, ACTIVITY, AMOUNT, BANK, CARDINAL, CURRENCY, DATE,
FAC, FREQUENCY, GPE, LANGUAGE, ORDINAL, ORG, PERCENT, PERSON, STT_ERROR, VEHICLE
ℹ Using 2 annotator IDs: ner_task_v2-annotator1, ner_task_v2-annotator2

✘ Requested labels: CURRENCY were not found in the dataset. Found
labels: ACCOUNT, VEHICLE, LANGUAGE, FAC, TIME, STT_ERROR, FREQUENCY, CARDINAL,
DATE, GPE, PERCENT, BANK, ORG, PERSON, AMOUNT, ACTIVITY, ORDINAL.

Is this saying that just DATE, ORDINAL and CURRENCY are not present in the dataset? All of these labels might not be represented in the dataset. Is there a way to ignore labels that are not present?

This is what the recipe produces. Each annotator reviews the output and adjusts/removes/adds spans by label. I want to compare their answers based on the spans field:

Hi @cheyanneb!

Thanks for trying out IAA!

For the first of the issues reported:

Could you try running it without the multilabel argument? Like so:

prodigy metric.iaa.span dataset:ner_task_v2 -l ACCOUNT,ACTIVITY,AMOUNT,BANK,CARDINAL,CURRENCY,DATE,FAC,FREQUENCY,GPE,LANGUAGE,ORDINAL,ORG,PERCENT,PERSON,STT_ERROR,VEHICLE
Using 17 label(s): ACCOUNT, ACTIVITY, AMOUNT, BANK, CARDINAL, CURRENCY, DATE,
FAC, FREQUENCY, GPE, LANGUAGE, ORDINAL, ORG, PERCENT, PERSON, STT_ERROR, VEHICLE

The multilabel annotation type argument is, indeed, not being recognized by metric.iaa.span command. Only metric.iaa.doc command takes the annotation type argument.

For the second issue reported:

prodigy metric.iaa.doc dataset:ner_task_v2 multilabel -l ACCOUNT,ACTIVITY,AMOUNT,BANK,CARDINAL,CURRENCY,DATE,FAC,FREQUENCY,GPE,LANGUAGE,ORDINAL,ORG,PERCENT,PERSON,STT_ERROR,VEHICLE

ℹ Using 2 annotator IDs: ner_task_v2-annotator1, ner_task_v2-annotator2

✘ Requested labels: FREQUENCY, GPE, CARDINAL, FAC, BANK, STT_ERROR,
LANGUAGE, ACCOUNT, ORG, VEHICLE, AMOUNT, CURRENCY, ACTIVITY, PERCENT, PERSON,
DATE, ORDINAL were not found in the dataset. Found labels: .

The reason the recipe reports "labels not found" is that, you're applying document-level IAA to span level data. The big part of the difference between metric.iaa.doc and matric.iaa.span is where it looks for the annotations/labels. The document-level annotations are stored under the accept key, while for the token/span-level annotations, it is the spans key. So what happened there is that the recipe looked for the labels listed under the accept key and they weren't there because it's a NER dataset.
It's a bit tricky to validate the command requested by the user and the data in this case. You could have both answer and spans keys in your data and wanting to use just one of them for the calculation. We've implemented some checks, but to some extent it's on the user to make sure the command assumptions match the data. We've tried to make this explicit in out guide here: Annotation Metrics · Prodigy · An annotation tool for AI, Machine Learning & NLP. Do let us know if it's not entirely clear - thank you :pray:
So to make this work, the command should look like the one I provided above.

Finally, to use only subset of labels in the IAA computation just exclude it from the list passed to the -l argument (I looks like it only complained about the CURRENCY?), so:

prodigy metric.iaa.span dataset:ner_task_v2 -l ACCOUNT,ACTIVITY,AMOUNT,BANK,CARDINAL,DATE,FAC,FREQUENCY,GPE,LANGUAGE,ORDINAL,ORG,PERCENT,PERSON,STT_ERROR,VEHICLE

Let me know how it goes! And thanks again for the feedback :slight_smile:

I got a new error -- does this mean one annotator got a duplicate?

(prodigy) cheyannebaird@Cheyannes-MBP:~/posh/annotation-service$ prodigy metric.iaa.span dataset:ner_task_v2 -l ACCOUNT,ACTIVITY,AMOUNT,BANK,CARDINAL,DATE,FAC,FREQUENCY,GPE,LANGUAGE,ORDINAL,ORG,PERCENT,PERSON,STT_ERROR,VEHICLE
Using 16 label(s): ACCOUNT, ACTIVITY, AMOUNT, BANK, CARDINAL, DATE, FAC,
FREQUENCY, GPE, LANGUAGE, ORDINAL, ORG, PERCENT, PERSON, STT_ERROR, VEHICLE
ℹ Using 2 annotator IDs: ner_task_v2-linnea, ner_task_v2-katrina

✘ Cannot build reliability matrix: multiple annotations of a single
task `1049727108` from annotator `ner_task_v2-linnea`.

I looked at the data, and I do see two entries (I masked the content of the utterance). We haven't seen issues with duplicates in a while, so this is surprising. But it seems that duplicate entries for a single annotator throws this error? What about empty spans? We also have data where there are no labels -- and I do want to see if the annotators agree in this case.

{"text":"redacted content.","labels":[{"start":7,"end":15,"label":"ACTIVITY"},{"start":24,"end":32,"label":"ACCOUNT"},{"start":36,"end":40,"label":"TIME"}],"_input_hash":1049727108,"_task_hash":777148641,"tokens":[{"text":"Make","start":0,"end":4,"id":0,"ws":true},{"text":"a","start":5,"end":6,"id":1,"ws":true},{"text":"transfer","start":7,"end":15,"id":2,"ws":true},{"text":"from","start":16,"end":20,"id":3,"ws":true},{"text":"my","start":21,"end":23,"id":4,"ws":true},{"text":"checking","start":24,"end":32,"id":5,"ws":true},{"text":"to","start":33,"end":35,"id":6,"ws":true},{"text":"line","start":36,"end":40,"id":7,"ws":false},{"text":".","start":40,"end":41,"id":8,"ws":false}],"_view_id":"blocks","spans":[{"start":7,"end":15,"token_start":2,"token_end":2,"label":"ACTIVITY"},{"start":24,"end":32,"token_start":5,"token_end":5,"label":"ACCOUNT"},{"start":36,"end":40,"token_start":7,"token_end":7,"label":"ACCOUNT"}],"answer":"accept","_annotator_id":"ner_task_v2-linnea","_session_id":"ner_task_v2-linnea"}
{"text":"redacted content.","labels":[{"start":7,"end":15,"label":"ACTIVITY"},{"start":24,"end":32,"label":"ACCOUNT"},{"start":36,"end":40,"label":"TIME"}],"_input_hash":1049727108,"_task_hash":777148641,"tokens":[{"text":"Make","start":0,"end":4,"id":0,"ws":true},{"text":"a","start":5,"end":6,"id":1,"ws":true},{"text":"transfer","start":7,"end":15,"id":2,"ws":true},{"text":"from","start":16,"end":20,"id":3,"ws":true},{"text":"my","start":21,"end":23,"id":4,"ws":true},{"text":"checking","start":24,"end":32,"id":5,"ws":true},{"text":"to","start":33,"end":35,"id":6,"ws":true},{"text":"line","start":36,"end":40,"id":7,"ws":false},{"text":".","start":40,"end":41,"id":8,"ws":false}],"_view_id":"blocks","spans":[{"start":7,"end":15,"token_start":2,"token_end":2,"label":"ACTIVITY"},{"start":24,"end":32,"token_start":5,"token_end":5,"label":"ACCOUNT"},{"start":36,"end":40,"token_start":7,"token_end":7,"label":"ACCOUNT"}],"answer":"accept","_annotator_id":"ner_task_v2-linnea","_session_id":"ner_task_v2-linnea"}

One more question! Can I exclude certain annotators? Some of us just do little tests, and we're not really annotating...

(prodigy) cheyannebaird@Cheyannes-MBP:~/posh/annotation-service$ prodigy metric.iaa.span dataset:ner_task_v1_c -l ACCOUNT,ACTIVITY,AMOUNT,BANK,CARDINAL,DATE,FREQUENCY,GPE,ORDINAL,ORG,PERSON,STT_ERROR,VEHICLE
Using 13 label(s): ACCOUNT, ACTIVITY, AMOUNT, BANK, CARDINAL, DATE, FREQUENCY,
GPE, ORDINAL, ORG, PERSON, STT_ERROR, VEHICLE
ℹ Using 4 annotator IDs: ner_task_v1_c-cheyanne, ner_task_v1_c-linnea,
ner_task_v1_c-stefaan, ner_task_v1_c-katrina
ℹ Annotation Statistics

Hey @cheyanneb ,

Yes, you're right that the "Cannot build reliability matrix.." error is due to the same annotator (ner_task_v2-linnea) annotating the task 1049727108 more than once.
Do you have allow_work_stealing set to False in your config? I'm guessing the duplicates can come from work stealing or sever restart hiccups. We should probably investigate in another thread?

Re empty spans: Empty spans are considered a valid annotation, of course. If one annotator didn't annotate anything at all and the other one did, in the pairwise comparison, these would be counted accordingly as false positives or false negatives (depending which way the calculation goes). The confusion matrix that's printed by the recipe captures these confusions in the NONE (as no label) category.

  1. Is there a way to remove the duplicate from the database so I can get IAA metrics?
  2. Is there a way to exclude certain annotators -- for example, Cheyanne and Stefaan are admins just running tests, so I want to exclude them from metrics.
(prodigy) cheyannebaird@Cheyannes-MBP:~/posh/annotation-service$ prodigy metric.iaa.span dataset:ner_task_v1_c -l ACCOUNT,ACTIVITY,AMOUNT,BANK,CARDINAL,DATE,FREQUENCY,GPE,ORDINAL,ORG,PERSON,STT_ERROR,VEHICLE
Using 13 label(s): ACCOUNT, ACTIVITY, AMOUNT, BANK, CARDINAL, DATE, FREQUENCY,
GPE, ORDINAL, ORG, PERSON, STT_ERROR, VEHICLE
ℹ Using 4 annotator IDs: ner_task_v1_c-cheyanne, ner_task_v1_c-linnea,
ner_task_v1_c-stefaan, ner_task_v1_c-katrina
ℹ Annotation Statistics

I thought I had set allow_work_stealing to False, but my latest prodigy.json does not have it. I will add that.

Hey, yes, you can exclude annotators by specifying the list of annotators to be included using --annotators parameter.

1 Like

As for removing the duplicates, you'd have to it "manually" - here's a script to speed things up:

from prodigy.components.stream import get_stream
from prodigy.util import ANNOTATOR_ID_ATTR, INPUT_HASH_ATTR
from prodigy.components.db import connect
from prodigy.types import StreamType

def filter_duplicates_by_the_same_annotator(
    stream: StreamType,
) -> StreamType:
    """Filter duplicates from an incoming stream.

    stream (iterable): The stream of examples.
    YIELDS (dict): The filtered annotation examples.
    """
    seen = set()
    for eg in stream:
        if (eg[INPUT_HASH_ATTR], eg[ANNOTATOR_ID_ATTR]) in seen:
            continue
        yield eg
        seen.add((eg[INPUT_HASH_ATTR], eg[ANNOTATOR_ID_ATTR]))
stream = get_stream("dataset:with_duplicates")
stream = stream.apply(filter_duplicates_by_the_same_annotator)
db = connect()
db.add_dataset("deduplicated")
db.add_examples(list(stream), datasets=["deduplicated"])

When removing duplicates, I get the following:

✘ Dataset: 'with_duplicates' not found in the currently configured
Prodigy Database: postgresql

When trying to remove annotators from the analysis, I get this:

(prodigy) cheyannebaird@Cheyannes-MBP:~/posh/annotation-service$ prodigy metric.iaa.span dataset:ner_task_v1_c -l ACCOUNT,ACTIVITY,AMOUNT,BANK,CARDINAL,DATE,FREQUENCY,GPE,ORDINAL,ORG,PERSON,STT_ERROR,VEHICLE --annotators ner_task_v1_c-linnea,ner_task_v1_c-katrina
Using 13 label(s): ACCOUNT, ACTIVITY, AMOUNT, BANK, CARDINAL, DATE, FREQUENCY,
GPE, ORDINAL, ORG, PERSON, STT_ERROR, VEHICLE
ℹ Using 2 annotator IDs: ner_task_v1_c-linnea,
ner_task_v1_c-katrina
Traceback (most recent call last):
  File "/opt/homebrew/Caskroom/miniforge/base/envs/prodigy/lib/python3.9/runpy.py", line 197, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/opt/homebrew/Caskroom/miniforge/base/envs/prodigy/lib/python3.9/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/opt/homebrew/Caskroom/miniforge/base/envs/prodigy/lib/python3.9/site-packages/prodigy/__main__.py", line 50, in <module>
    main()
  File "/opt/homebrew/Caskroom/miniforge/base/envs/prodigy/lib/python3.9/site-packages/prodigy/__main__.py", line 44, in main
    controller = run_recipe(run_args)
  File "cython_src/prodigy/cli.pyx", line 117, in prodigy.cli.run_recipe
  File "cython_src/prodigy/cli.pyx", line 118, in prodigy.cli.run_recipe
  File "/opt/homebrew/Caskroom/miniforge/base/envs/prodigy/lib/python3.9/site-packages/prodigy/recipes/metric.py", line 203, in metric_iaa_span
    m.measure(stream)
  File "cython_src/prodigy/components/metrics/iaa_span.pyx", line 100, in prodigy.components.metrics.iaa_span.IaaSpan.measure
  File "cython_src/prodigy/components/metrics/_util.pyx", line 240, in prodigy.components.metrics._util._build_reliability_table
KeyError: 'ner_task_v1_c-cheyanne'

Hey!
For removing the duplicates: the names of the datasets in the script were just placeholders, you would have to substitute them with the names of your datasets. Sorry, should have given you the headsup about it! So with_duplicates should be substituted with the dataset name that you want to deduplicate and deduplicated should be substituted with the new name you want to give to the deduped dataset.

The second issue looks like a problem on my end - let me take a look!

Thank you. I reran it, and it produced this:

(prodigy) cheyannebaird@Cheyannes-MBP:~/posh/annotation_duplicate_removal$ python duplicate_removal.py

(prodigy) cheyannebaird@Cheyannes-MBP:~/posh/annotation_duplicate_removal$ prodigy metric.iaa.span dataset:ner_task_v2 -l ACCOUNT,ACTIVITY,AMOUNT,BANK,CARDINAL,DATE,FAC,FREQUENCY,GPE,LANGUAGE,ORDINAL,ORG,PERCENT,PERSON,STT_ERROR,VEHICLE
Using 16 label(s): ACCOUNT, ACTIVITY, AMOUNT, BANK, CARDINAL, DATE, FAC,
FREQUENCY, GPE, LANGUAGE, ORDINAL, ORG, PERCENT, PERSON, STT_ERROR, VEHICLE
ℹ Using 2 annotator IDs: ner_task_v2-katrina, ner_task_v2-linnea

✘ Cannot build reliability matrix: multiple annotations of a single
task `1049727108` from annotator `ner_task_v2-linnea`.

The code I ran:
Do I have to define INPUT_HASH_ATTR and ANNOTATOR_ID_ATTR? I tried leaving as is (error above), and defining these (see 2nd error below):

from prodigy.components.stream import get_stream
from prodigy.util import ANNOTATOR_ID_ATTR, INPUT_HASH_ATTR
from prodigy.components.db import connect
from prodigy.types import StreamType

def filter_duplicates_by_the_same_annotator(
    stream: StreamType,
) -> StreamType:
    """Filter duplicates from an incoming stream.

    stream (iterable): The stream of examples.
    YIELDS (dict): The filtered annotation examples.
    """
    seen = set()
    for eg in stream:
        if (eg[INPUT_HASH_ATTR], eg[ANNOTATOR_ID_ATTR]) in seen:
            continue
        yield eg
        seen.add((eg[INPUT_HASH_ATTR], eg[ANNOTATOR_ID_ATTR]))
stream = get_stream("dataset:ner_task_v2-linnea")
stream = stream.apply(filter_duplicates_by_the_same_annotator)
db = connect()
db.add_dataset("ner_task_v2-linnea_deduplicated")
db.add_examples(list(stream), datasets=["ner_task_v2-linnea_deduplicated"])
Traceback (most recent call last):
  File "/Users/cheyannebaird/posh/annotation_duplicate_removal/duplicate_removal.py", line 24, in <module>
    db.add_examples(list(stream), datasets=["ner_task_v2-linnea_deduplicated"])
  File "cython_src/prodigy/components/stream.pyx", line 166, in prodigy.components.stream.Stream.__next__
  File "cython_src/prodigy/components/stream.pyx", line 189, in prodigy.components.stream.Stream.is_empty
  File "cython_src/prodigy/components/stream.pyx", line 204, in prodigy.components.stream.Stream.peek
  File "cython_src/prodigy/components/stream.pyx", line 317, in prodigy.components.stream.Stream._get_from_iterator
  File "/Users/cheyannebaird/posh/annotation_duplicate_removal/duplicate_removal.py", line 16, in filter_duplicates_by_the_same_annotator
    if (eg[1049727108], eg[linnea]) in seen:
KeyError: 1049727108

Hey,
There's no need to define the INPUT_HASH_ATTR and ANNOTATOR_HASH_ATTR, these are sourced from Prodigy and are just constants that correspond to _input_hash and _task_hash in the annotation task, so that should work as is.
The only thing that you'd need to modify is to update the name of the dataset over which you're computing the metrics to use the deduplicated one generated by the deduplication script. The updated call to the metrics would be:

prodigy metric.iaa.span dataset:ner_task_v2-linnea_deduplicated -l ACCOUNT,ACTIVITY,AMOUNT,BANK,CARDINAL,DATE,FAC,FREQUENCY,GPE,LANGUAGE,ORDINAL,ORG,PERCENT,PERSON,STT_ERROR,VEHICLE

I also have an update on the filtering of the annotators, that indeed is a bug and we'll release the fix on Monday at the latest.

1 Like

That worked. Thanks! I'll check back on the annotator filter.

(prodigy) cheyannebaird@Cheyannes-MBP:~/posh/annotation_duplicate_removal$ prodigy metric.iaa.span dataset:ner_task_v2_deduped -l ACCOUNT,ACTIVITY,AMOUNT,BANK,CARDINAL,DATE,FAC,FREQUENCY,GPE,LANGUAGE,ORDINAL,ORG,PERCENT,PERSON,STT_ERROR,VEHICLE
Using 16 label(s): ACCOUNT, ACTIVITY, AMOUNT, BANK, CARDINAL, DATE, FAC,
FREQUENCY, GPE, LANGUAGE, ORDINAL, ORG, PERCENT, PERSON, STT_ERROR, VEHICLE
ℹ Using 2 annotator IDs: ner_task_v2-linnea, ner_task_v2-katrina
/opt/homebrew/Caskroom/miniforge/base/envs/prodigy/lib/python3.9/site-packages/prodigy/recipes/metric.py:203: RuntimeWarning: invalid value encountered in divide
  m.measure(stream)
ℹ Annotation Statistics

Hey @cheyanneb ,

I've just released Prodigy 1.14.5 with the fix for the annotator filter. We'll announce it a bit later (so you won't see the changelog on the website yet), but you can already pip install-it.

1 Like