Store additional information about named entity

Hello.

I have entities like dates, money, time periods. I'd like to record normalized values of these entities during manual labeling. For example: for "10th day of November, 2019" I want to record "2019-11-10", for "ten (10) days" - "days-10" etc. Can you describe what should I do to implement this scenario?

Hi! You could use a blocks UI with a text input field for that – however, I'd suggest doing the annotation in two steps. First, focus on highlighting the entities, then stream in the entities with a text_input field and add the normalized form.

A big advantage of this approach is that you can pre-sort the entities and combine identical spans. If a certain span occurs multiple times, you shouldn't have to normalize it every time you come across it (and introduce more potential for human error this way). It also lets you semi-automate the process and take advantage of existing solutions, like the dateutil library to pre-populate the value of the text_input field so you only have to type things manually if the result is wrong or if the entity can't be automatically parsed.

Here's a minimal example to queue up existing NER annotations again and only ask about each entity once. Your recipe could then use two blocks: text and text_input.

from prodigy.components.db import connect
from collections import defaultdict

db = connect()
examples = db.get_dataset("your_ner_dataset")

def get_stream():
    # Collect and merge spans so you're only doing each text/label combination once
    merged = defaultdict(list)
    for eg in examples:
        for i, span in enumerate(eg.get("spans", [])):
            text = eg["text"][span["start"]:span[]"end"]
            # Store index of span and task hash, so you know it's the n-th entity in task X
            merged[(text, eg["label"])].append((eg["_task_hash"], i))

    for (text, label), refs in merged:
        # optionally pre-populate "user_input" (or custom field ID)
        # with auto-normalized text
        yield {
            "text": f"{text} ({label})",
            "task_refs": refs
        }
1 Like

Thanks for your reply!

One more question: if I had more then one instance of named entity with same text and label in one task, then I can fed to the UI only first entity. So, my orig_hash and orig_span will be point only on the first entity in "merged[my_ent_1]" list. How can I link other instances with the same normalized value?

Ah, sorry, just realised an error in my code example: merged maps text/label pairs to a list of references, tuples of (task_hash, span_idx). So you can just store that list with each task – edited the example to reflect that.

(Maybe you also want to solve this differently – for instance, you could assign unique IDs to all spans and then use those to relate the annotations back.)

Is there a way to run through all my normalized values and add them as additional field to corresponding span (e.g. when I click Accept or save results). Now I have results of two recipes (ner.manual and normalization) in one jsonl:

{ner.manual result 1}
...
{ner.manual result n}
{normalization result 1}
...
{normalization result m}

This is inconvenient, because I have to read whole file and assign normalized values to spans before I can feed data to my model. Maybe I can bind this job on some prodigy's event?

It probably makes most sense to do this in Python, so you could implement this as an update callback of your custom recipe that's called with the answers whenever a new batch is sent back from the app.

Your recipe has to load the original data with the spans into memory anyways, in order to create the stream to normalize. So you might as well create a combined dataset as you annotate. A simple way would be to create a dict of annotations keyed by _task_hash, so you can easily find the respective example(s). At the end of it, you can save it all out to a file or a new dataset (datasets are append-only, so you want to use a new dataset – also makes things easier if something goes wrong). You probably also want to save out backups, so you don't lose any data if your server dies.