Cannot build reliability matrix: multiple annotations from <ANNOTATOR>

Hi folks. Really excited to try out the new IAA metrics recipes, but unfortunately I'm banging my head up against some issues.

I tried to chase the issue down in the source, but the error being thrown is within the pre-compiled source I can't access.

    m = IaaDoc(annotation_type=annotation_type, labels=labels, annotators=annotators)
    stream = get_stream(source, loader=loader, rehash=False)
    except MetricError as e:, exits=True)

I'm using customized recipes to store the annotations for which I'd like the metrics, and unfortunately the dataset is sensitive. I could possibly try to give you a tiny subset with just a couple of examples annotated by just a pair of annotators, if that's necessary?

I'm using a bunch of different LLMs (local and APIs) as annotators to help me rapidly iterate on a labelling scheme. I updated my custom recipes to ensure the "view_id" was being set, and the local workflow is in a happy place for using my slightly-customized "review" recipe.

Is it helpful for me to share the annotation recipe? It's as follows -

from typing import Iterable, Union

from prodigy.cli import serve
from prodigy.components.filters import filter_seen_before
from prodigy.components.preprocess import make_textcat_suggestions, add_tokens, add_annot_name
from import get_stream, Stream
from prodigy.core import recipe, Arg, connect
from prodigy.util import ANNOTATOR_ID_ATTR
from spacy import Language
from spacy.lang.en import English

from lib.complaints import get_complaint_labels
from lib.explicit_langchain_model import AvailableModels, component_name
from lib.utils import chunk_stream, datafile

def add_view_info(stream: Stream):
    for example in stream:
        config = dict.get(example, 'config', dict())
        config['choice_style'] = "multiple"

        example['_view_id'] = "choice"
        example['config'] = config

        yield example

    dataset=Arg(help="Dataset to save annotations to"),
    source=Arg(help="Data to annotate (file path or '-' to read from standard input)"),
    cpp_filename=Arg(help="GGUF format model filename saved to LLM directory"),
    model_alias=Arg(help="Annotator alias on behalf of the model"),
def textcat_explicit_langchain_model_annotate(
    dataset: str,
    source: Union[str, Iterable[dict]],
    cpp_filename: str,
    model_alias: str,
) -> None:
    stream = get_stream(source, api=None, loader=None, rehash=True, input_key="text")
    component = "llm"
    nlp: Language = English()
                 config={'cpp_filename': cpp_filename})

    labels = get_complaint_labels()

    db = connect()
    if dataset in db.datasets:
        already_annotated = (
            for ex in db.iter_dataset_examples(dataset)
            if ex[ANNOTATOR_ID_ATTR] == model_alias
        stream.apply(filter_seen_before, stream=stream, cache_stream=already_annotated)

    chunk_size = 50

    for chunk in chunk_stream(stream, chunk_size):
        chunked_stream = get_stream(chunk, api=None, loader=None, rehash=True, input_key="text")

        chunked_stream.apply(add_tokens, nlp=nlp, stream=chunked_stream)
        chunked_stream.apply(add_annot_name, name=model_alias)

        db.add_dataset(model_alias, session=True)
        db.add_examples(chunked_stream, [dataset, model_alias])

source_examples = datafile('complaints.jsonl')
save_to_dataset = 'complaints'
model_file = AvailableModels.speechless_13b()

cli_command = (f'prodigy textcat.explicit_langchain_model_annotate '
               f'{save_to_dataset} {source_examples} {model_file} {model_file}')


It's really only customized to make it friendlier to local models / output parsers, and to ensure it saves its progress more frequently than the bundled equivalent recipe.

The error was raised with a brand new prodigy database, with a couple of hundred model-annotated examples, by 2 diffferent local LLMs, so it's pretty fast for me to test tweaks - it's more that I'm not sure how much I can do on my own without a clearer idea of what the recipe is trying to achieve when it throws this error.

Thanks for the kickass product, by the way. Prodigy is awesome, and I'm not really blocked by this - I just wanted to see if the metrics can help me drive the labelling scheme and model improvements.

Hi @dansowter

Thanks so much for all the kind words and for the extensive debugging info - we really appreciate that :slight_smile:

The error you are seeing is meant to show up whenever a single annotator has annotated the same task more than once (I realize now that the message must be improved).
As a reminder, for the purpose of IAA recipes, the tasks are distinguished based on the _input_hash attribute so if there are multiple examples in your dataset with the same combination of the _input_hash and _annotator_id, you'd have to deduplicate before computing the metrics.

Arguably, we could modify the feature to take the latest annotation in these cases, but I thought it's actually not intended in most cases unless you are measuring intra-annotator agreement so we decided to raise an error in these cases instead.

The first thing I would check then if that's indeed the case for your input dataset (in theory it would minimally be the case for the annotator_id provided in the error message).

To speed things up here's script that can be helpful for checking it:

from import get_stream
from prodigy.util import ANNOTATOR_ID_ATTR, INPUT_HASH_ATTR
from itertools import groupby


stream = get_stream(source_path)

examples = list(stream)

grouped_by_input = groupby(sorted(examples, key=lambda x: x[INPUT_HASH_ATTR]),lambda x: x[INPUT_HASH_ATTR])
for input_hash, input_annotations in grouped_by_input:
    input_annotations = list(input_annotations)
        assert len(set(a[ANNOTATOR_ID_ATTR] for a in input_annotations)) == len(input_annotations)
    except AssertionError:
        grouped_by_annotator_id = groupby(sorted(input_annotations, key=lambda x: x[ANNOTATOR_ID_ATTR]),lambda x: x[ANNOTATOR_ID_ATTR])
        for annotator_id, anns in grouped_by_annotator_id:
            print(annotator_id, input_hash)

If the duplicates are there, and are not intended, we should look into where they come from, but let's first confirm that it is actually the case.

Thanks so much. You're absolutely right - I've gone back and removed the duplicate text examples which led to the issue, and it's now resolved.

I can see the IAA metrics for each label now, which is brilliant, thank you.

1 Like