Creating a custom recipe to integrate bespoke model

I want to use ner.match but with my custom ner model.
My model takes in a text and outputs the span of the recognized term and the label associated with it.
I want the ner.match recipe to show the highlighted term along with its label from my model.
How do i wrap the ner.match recipe to achieve this. And if this cannot be done by simply wrapping the ner.match recipe, how do i create my custom recipe to achieve the same result.
Thank you!

Hi! I think you might find it easier to write your own recipe, since it'll make it easier to see what's going on, and the logic itself isn't that complicated.

Here's a simplified version of the ner.match recipe with some comments that explain what's going on:

In the recipe above, it uses the pattern matcher and spaCy to add the pattern matches to your stream. But instead, you can also write a function that takes a text and returns the start and end character offsets and the label. For each span, you can then yield out a dictionary with the "text" and "spans". Here's an example:

def get_stream(stream):
    for eg in stream:
        spans_from_model = get_spans_from_your_model(eg["text"])
        for start_char, end_char, label in spans_from_model:
            # Let's assume your function returns a tuple of the start and end
            # offset and the label. For each span, we now create a new task
            # and send it out
            spans = [{"start": start_char, "end": end_char, "label": label}]
            yield {"text": eg["text"], "spans": spans}

In your recipe, you can then load your data (however you want) and create your stream:

stream = JSONL(source)
stream = get_stream(stream)

So the most basic version of your recipe could look like this (plus the get_stream function of course):

import prodigy
from prodigy.components.loaders import JSONL

@prodigy.recipe('custom.ner.match',
    dataset=("The dataset to use", "positional", None, str),
    source=("The source data as a JSONL file", "positional", None, str),
)
def custom_ner_match(dataset, source):
    stream = JSONL(source)
    stream = get_stream(stream)

    return {
        'view_id': 'ner',       # Annotation interface to use
        'dataset': dataset,     # Name of dataset to save annotations
        'stream': stream,       # Incoming stream of examples
    }

Thank you so much!
worked like a charm
Follow up question:
Is it possible to create a custom recipe that combines the functionality of ner.match and ner.manual?

Yay, glad to hear it worked!

Sure :slightly_smiling_face: You should only have to change a few small things:

  • use the ner_manual view ID instead of just ner
  • add a "config": {"labels": [...]} to the components returned by your recipe that defines the full label scheme you can select
  • make sure each incoming example is tokenized and has a "tokens" property (to allow quick highlighting that "snaps" to token boundaries)
  • only send out one example per text (instead of one example per span) because you probably want to see all matches at once, right?

For tokenization, Prodigy has a built-in add_tokens helper. You can also see an example of it in the prodigy-recipes repo. The function takes a spaCy nlp object for tokenization and the stream, and will add a "tokens" property to each example.

import spacy
from prodigy.components.preprocess import add_tokens

# At the end of your recipe
nlp = spacy.load(spacy_model)
stream = add_tokens(nlp, stream)

One thing that's important to note here: the tokenization used here should match the tokenization of your custom model and allow the entities to be valid token spans. So if your model uses a custom tokenizer, you might want to use that instead and create the "tokens" property yourself – you can find the format in the "Annotation task formats" section in your PRODIGY_README.html.

To only send out one example instead of one example per span, your get_stream, could be simplified like this:

def get_stream(stream):
    for eg in stream:
        spans_from_model = get_spans_from_your_model(eg["text"])
        eg["spans"] = [{"start": start_char, "end": end_char, "label": label}
                       for start_char, end_char, label in spans_from_model]
        yield eg

In the return statement of your recipe, you can now change the view_id and add the labels:

return {
    "view_id": "ner_manual", 
    "dataset": dataset,
    "stream": stream,
    "config": {
        "labels": ["SOME_LABEL", "FOO", "BAR"]
    }
}

You should now see all entities in the text highlighted and editable, with the list of labels as selectable options on top.