Confused about the structure of spans in NER examples

Esteemed prodigy experts,

I have played with the get_dataset_examples functionality and I am receiving two different formats of examples. Could someone help to clear up why this exists?

My code is very simple: I am handing over the name of a dataset and collect the examples in a list.

       db = connect()

        # Set up the result list
        lst_examples = []

        examples = db.get_dataset_examples(dataset_name)
        for example in examples:
            lst_examples.append(example)

        db.close()

This all works as expected, but when I expect the spans in the examples, I get two different formats of spans. Two have the actual text of the span, a source and an input hash - the third one does not.

I am using ner.correct in this case. My hunch is that the "longer" format is created by the prediction model while the "shorter" format is created by a manual action. Could that be true and if so, why is that?

I've tried peaking into the db itself but I don't think that will clear this up.

Many thanks for any help!
Kai

Oh, in case this is helpful/or needed: I am on the latest prodigy, Python 3.11 and Windows

Hi @akimotode and welcome to the forum :wave:

You're definitely on the right track! The annotations with extra keys come from the model. The source field gives you the name of the NLP pipeline that added the annotations. Annotations you add manually via Prodigy UI won't have these extra keys.

The manual annotations contain just the minimum viable information that are required to train the model. The extra information in the case of model annotated spans comes "for free" from the spaCy pipeline, concretely from the Span attributes of recognized entites so the recipe just adds it.

Thank you Magda for the very quick answer! At least now I know I haven't managed to shoot myself in the foot somehow...

If I may, a quick feature request: I find it really useful that the model adds the "text" information. I can get this via tokens anyway, but it is a bit tricky (and it is not easily visible during debugging). If there was a simple way to add "text" to the manual annotation, it would be helpful for my use case. It's definitely only a convenience change, I can already do what I want to do.

My use case: I want to iterate over all examples and output a report which spans have been tagged with which label across a group of datasets. Sort of "automated meta" quality control across datasets and raters...

Thanks,
Kai

Hi @akimotode,

Thanks for the suggestion! The thinking here is that Prodigy should not really modify the annotations coming from the user as they can contain custom fields and values.

The easiest way to add text to the manual annotation would be either to run a simple postprocessing script on the annotated dataset (which is what you're probably doing) or to create a custom NER recipe and implement span processing function via before_db callback.
Here's a simplified version of such recipe:

from pathlib import Path
from typing import Any, Dict

import prodigy
import spacy
from prodigy.components.stream import get_stream
from prodigy.components.preprocess import add_tokens


def before_db(examples):
    for example in examples:
        for span in example['spans']:
            start = span['start']
            end = span['end']
            span['text'] = example['text'][start:end]
    return examples

@prodigy.recipe(
    "ner.with_text",
    dataset=("Dataset to save annotations to", "positional", None, str),
    lang=("language for the tokeniser", "positional", None, str),
    source=("Data to annotate", "positional", None, str),
    labels=("comma separated sequence of labels", "option", "l", str),
)
def ner_with_text(dataset: str, lang: str, source: Path, labels: str) -> Dict[str, Any]:
    nlp = spacy.blank(lang)
    labels = labels.split(",")
    stream = get_stream(source, rehash=True, input_key="text", dedup=True)
    stream = stream.apply(add_tokens,nlp=nlp, stream=stream)

    return {
        "dataset": dataset,
        "stream": stream,
        "view_id": "ner_manual",
        "before_db": before_db,
        "config": {
            "lang": lang,
            "labels": labels,
            "batch_size": 1
        },
    }

Now if you test it with:

{"text": "hi my name is Kai"}

You'll see that the DB record for this example will contain the text field:

"spans": [
    {
      "start": 14,
      "end": 17,
      "token_start": 4,
      "token_end": 4,
      "label": "name",
      "text": "Kai"
    }

Thank you Magda,

This is an excellent answer!

1 Like