Custom Recipe Using Different Tokens for Each NER Block

Hello!

I am writing a custom recipe to use multiple NER blocks, each relying on a different set of tokens. The task is such that the all blocks need to be annotated at the same time; I can't easily break this into multiple tasks. However, I have only found how to make all the NER blocks rely on the same set of tokens, set in the 'token' key of the example. Ideally, I would like each NER block to pull tokens from a different key, set like "field_id" for "text_input".

Here's a simplified version of my recipe:

from pathlib import Path
from typing import Union, Dict, Any

import srsly
from prodigy.core import recipe


@recipe(
    "ner.double",
    dataset=("Dataset to save annotations to", "positional", None, str),
    example_file=("JSONL file with examples", "positional", None, str),
)
def ner_double(
        dataset: str,
        example_file: Union[str, Path],
) -> Dict[str, Any]:

    def get_stream(examples):
        for example in examples:
            yield {
                "id": example["id"],
                "tokens_1": make_prodigy_tokens(example["key_1"]),
                "tokens_2": make_prodigy_tokens(example["key_2"]),
            }

    examples = srsly.read_jsonl(example_file)
    stream = get_stream(examples)

    blocks = [
        {
            "view_id": "ner_manual",
            "field_id": "tokens_1",  # Does not work.
            "labels": ["label_1"],
        },
        {
            "view_id": "ner_manual",
            "field_id": "tokens_2",  # Does not work.
            "labels": ["label_2"],
        },
    ]

    return {
        "dataset": dataset,
        "stream": stream,
        "view_id": "blocks",
        "config": {
            "blocks": blocks,
        },
    }


class Tokenizer:
    def tokenize(self, text):
        return text.split()


def make_prodigy_tokens(
        string: str,
        tokenizer: Tokenizer = Tokenizer(),
):
    def wrap_token(i, token, last_end):
        start = last_end + string[last_end:].find(token)
        end = start + len(token)
        return {
            "id": i,
            "text": token,
            "start": start,
            "end": end,
        }

    token_list = []
    for i, token in enumerate(tokenizer.tokenize(string)):
        last_end = 0 if i == 0 else token_list[-1]["end"]
        token_dict = wrap_token(i, token, last_end)
        token_list.append(token_dict)
    return token_list

Is there any way to do this?

Thank you in advance!

Hi! At the moment, all blocks need to refer to a single underlying example, so you can only have one "text" and set of "tokens" and "spans". While you could overwrite the blocks for each individual example, the annotations you create would still be written to the same "spans", so that's not really an option.

Probably the easiest solution would be to just combine your texts into one single ner_manual interface and separate the texts with two line break tokens, or similar. You can always store the character offsets of the different texts in the underlying data, which makes it easy to extract the individual texts and annotations separately later on, if you need to. (Each span's offset is start - text_offset and end - text_offset.)

Thanks @ines for the quick reply! I was hoping not to combine the texts into a single NER interface, as they can get long, and it'd be nice to have a visual cue to separate them, but I will try it out!