Confused about the structure of spans in NER examples

akimotode · January 23, 2024, 2:54pm

Esteemed prodigy experts,

I have played with the get_dataset_examples functionality and I am receiving two different formats of examples. Could someone help to clear up why this exists?

My code is very simple: I am handing over the name of a dataset and collect the examples in a list.

       db = connect()

        # Set up the result list
        lst_examples = []

        examples = db.get_dataset_examples(dataset_name)
        for example in examples:
            lst_examples.append(example)

        db.close()

This all works as expected, but when I expect the spans in the examples, I get two different formats of spans. Two have the actual text of the span, a source and an input hash - the third one does not.

I am using ner.correct in this case. My hunch is that the "longer" format is created by the prediction model while the "shorter" format is created by a manual action. Could that be true and if so, why is that?

I've tried peaking into the db itself but I don't think that will clear this up.

Many thanks for any help!
Kai

akimotode · January 23, 2024, 3:03pm

Oh, in case this is helpful/or needed: I am on the latest prodigy, Python 3.11 and Windows

magdaaniol · January 24, 2024, 11:20am

Hi @akimotode and welcome to the forum

You're definitely on the right track! The annotations with extra keys come from the model. The source field gives you the name of the NLP pipeline that added the annotations. Annotations you add manually via Prodigy UI won't have these extra keys.

The manual annotations contain just the minimum viable information that are required to train the model. The extra information in the case of model annotated spans comes "for free" from the spaCy pipeline, concretely from the Span attributes of recognized entites so the recipe just adds it.

akimotode · January 24, 2024, 12:48pm

Thank you Magda for the very quick answer! At least now I know I haven't managed to shoot myself in the foot somehow...

If I may, a quick feature request: I find it really useful that the model adds the "text" information. I can get this via tokens anyway, but it is a bit tricky (and it is not easily visible during debugging). If there was a simple way to add "text" to the manual annotation, it would be helpful for my use case. It's definitely only a convenience change, I can already do what I want to do.

My use case: I want to iterate over all examples and output a report which spans have been tagged with which label across a group of datasets. Sort of "automated meta" quality control across datasets and raters...

Thanks,
Kai

magdaaniol · January 26, 2024, 1:04pm

Hi @akimotode,

Thanks for the suggestion! The thinking here is that Prodigy should not really modify the annotations coming from the user as they can contain custom fields and values.

The easiest way to add text to the manual annotation would be either to run a simple postprocessing script on the annotated dataset (which is what you're probably doing) or to create a custom NER recipe and implement span processing function via before_db callback.
Here's a simplified version of such recipe:

from pathlib import Path
from typing import Any, Dict

import prodigy
import spacy
from prodigy.components.stream import get_stream
from prodigy.components.preprocess import add_tokens


def before_db(examples):
    for example in examples:
        for span in example['spans']:
            start = span['start']
            end = span['end']
            span['text'] = example['text'][start:end]
    return examples

@prodigy.recipe(
    "ner.with_text",
    dataset=("Dataset to save annotations to", "positional", None, str),
    lang=("language for the tokeniser", "positional", None, str),
    source=("Data to annotate", "positional", None, str),
    labels=("comma separated sequence of labels", "option", "l", str),
)
def ner_with_text(dataset: str, lang: str, source: Path, labels: str) -> Dict[str, Any]:
    nlp = spacy.blank(lang)
    labels = labels.split(",")
    stream = get_stream(source, rehash=True, input_key="text", dedup=True)
    stream = stream.apply(add_tokens,nlp=nlp, stream=stream)

    return {
        "dataset": dataset,
        "stream": stream,
        "view_id": "ner_manual",
        "before_db": before_db,
        "config": {
            "lang": lang,
            "labels": labels,
            "batch_size": 1
        },
    }

Now if you test it with:

{"text": "hi my name is Kai"}

You'll see that the DB record for this example will contain the text field:

"spans": [
    {
      "start": 14,
      "end": 17,
      "token_start": 4,
      "token_end": 4,
      "label": "name",
      "text": "Kai"
    }

akimotode · January 30, 2024, 2:07pm

Thank you Magda,

This is an excellent answer!

Topic		Replies	Views
NER manual source data fomat Getting Started usage , ner , spacy	1	240	September 21, 2022
Different meta data in ner.correct output - 'spans'. usage , ner	1	252	February 22, 2022
Using a handmade annotation file for model training ner , best-practices	3	1627	June 22, 2018
Getting Started Questions usage , ner	1	631	November 6, 2018
Combining ner.teach with patterns file and manual correction of spans usage , ner , front-end	2	785	September 11, 2020

Confused about the structure of spans in NER examples

Related topics