Custom Recipe Using Different Tokens for Each NER Block

anonymouse · July 20, 2021, 10:44pm

Hello!

I am writing a custom recipe to use multiple NER blocks, each relying on a different set of tokens. The task is such that the all blocks need to be annotated at the same time; I can't easily break this into multiple tasks. However, I have only found how to make all the NER blocks rely on the same set of tokens, set in the 'token' key of the example. Ideally, I would like each NER block to pull tokens from a different key, set like "field_id" for "text_input".

Here's a simplified version of my recipe:

from pathlib import Path
from typing import Union, Dict, Any

import srsly
from prodigy.core import recipe


@recipe(
    "ner.double",
    dataset=("Dataset to save annotations to", "positional", None, str),
    example_file=("JSONL file with examples", "positional", None, str),
)
def ner_double(
        dataset: str,
        example_file: Union[str, Path],
) -> Dict[str, Any]:

    def get_stream(examples):
        for example in examples:
            yield {
                "id": example["id"],
                "tokens_1": make_prodigy_tokens(example["key_1"]),
                "tokens_2": make_prodigy_tokens(example["key_2"]),
            }

    examples = srsly.read_jsonl(example_file)
    stream = get_stream(examples)

    blocks = [
        {
            "view_id": "ner_manual",
            "field_id": "tokens_1",  # Does not work.
            "labels": ["label_1"],
        },
        {
            "view_id": "ner_manual",
            "field_id": "tokens_2",  # Does not work.
            "labels": ["label_2"],
        },
    ]

    return {
        "dataset": dataset,
        "stream": stream,
        "view_id": "blocks",
        "config": {
            "blocks": blocks,
        },
    }


class Tokenizer:
    def tokenize(self, text):
        return text.split()


def make_prodigy_tokens(
        string: str,
        tokenizer: Tokenizer = Tokenizer(),
):
    def wrap_token(i, token, last_end):
        start = last_end + string[last_end:].find(token)
        end = start + len(token)
        return {
            "id": i,
            "text": token,
            "start": start,
            "end": end,
        }

    token_list = []
    for i, token in enumerate(tokenizer.tokenize(string)):
        last_end = 0 if i == 0 else token_list[-1]["end"]
        token_dict = wrap_token(i, token, last_end)
        token_list.append(token_dict)
    return token_list

Is there any way to do this?

Thank you in advance!

ines · July 21, 2021, 7:01am

Hi! At the moment, all blocks need to refer to a single underlying example, so you can only have one "text" and set of "tokens" and "spans". While you could overwrite the blocks for each individual example, the annotations you create would still be written to the same "spans", so that's not really an option.

Probably the easiest solution would be to just combine your texts into one single ner_manual interface and separate the texts with two line break tokens, or similar. You can always store the character offsets of the different texts in the underlying data, which makes it easy to extract the individual texts and annotations separately later on, if you need to. (Each span's offset is start - text_offset and end - text_offset.)

anonymouse · July 21, 2021, 10:38am

Thanks @ines for the quick reply! I was hoping not to combine the texts into a single NER interface, as they can get long, and it'd be nice to have a visual cue to separate them, but I will try it out!

Topic		Replies	Views
Is it possible to have a recipe receive multiple streams? ner , relations	1	15	April 8, 2025
Custom recipe error - blocks: extra fields not permitted usage , ner , solved	1	1064	November 25, 2020
ner.manual - custom text formatting usage , ner , front-end	3	773	August 10, 2020
Combining 'text' and 'ner_manual' interfaces with different text? usage , custom , front-end , solved	4	419	February 9, 2022
2 separate text classification blocks in the same recipy (one single choice, the other multi-choice) usage , front-end	1	117	February 1, 2024

Custom Recipe Using Different Tokens for Each NER Block

Related topics