I want to use ner.match but with my custom ner model.
My model takes in a text and outputs the span of the recognized term and the label associated with it.
I want the ner.match recipe to show the highlighted term along with its label from my model.
How do i wrap the ner.match recipe to achieve this. And if this cannot be done by simply wrapping the ner.match recipe, how do i create my custom recipe to achieve the same result.
Thank you!
Hi! I think you might find it easier to write your own recipe, since it'll make it easier to see what's going on, and the logic itself isn't that complicated.
Here's a simplified version of the ner.match
recipe with some comments that explain what's going on:
In the recipe above, it uses the pattern matcher and spaCy to add the pattern matches to your stream. But instead, you can also write a function that takes a text and returns the start and end character offsets and the label. For each span, you can then yield out a dictionary with the "text"
and "spans"
. Here's an example:
def get_stream(stream):
for eg in stream:
spans_from_model = get_spans_from_your_model(eg["text"])
for start_char, end_char, label in spans_from_model:
# Let's assume your function returns a tuple of the start and end
# offset and the label. For each span, we now create a new task
# and send it out
spans = [{"start": start_char, "end": end_char, "label": label}]
yield {"text": eg["text"], "spans": spans}
In your recipe, you can then load your data (however you want) and create your stream:
stream = JSONL(source)
stream = get_stream(stream)
So the most basic version of your recipe could look like this (plus the get_stream
function of course):
import prodigy
from prodigy.components.loaders import JSONL
@prodigy.recipe('custom.ner.match',
dataset=("The dataset to use", "positional", None, str),
source=("The source data as a JSONL file", "positional", None, str),
)
def custom_ner_match(dataset, source):
stream = JSONL(source)
stream = get_stream(stream)
return {
'view_id': 'ner', # Annotation interface to use
'dataset': dataset, # Name of dataset to save annotations
'stream': stream, # Incoming stream of examples
}
Thank you so much!
worked like a charm
Follow up question:
Is it possible to create a custom recipe that combines the functionality of ner.match and ner.manual?
Yay, glad to hear it worked!
Sure You should only have to change a few small things:
- use the
ner_manual
view ID instead of justner
- add a
"config": {"labels": [...]}
to the components returned by your recipe that defines the full label scheme you can select - make sure each incoming example is tokenized and has a
"tokens"
property (to allow quick highlighting that "snaps" to token boundaries) - only send out one example per text (instead of one example per span) because you probably want to see all matches at once, right?
For tokenization, Prodigy has a built-in add_tokens
helper. You can also see an example of it in the prodigy-recipes
repo. The function takes a spaCy nlp
object for tokenization and the stream, and will add a "tokens"
property to each example.
import spacy
from prodigy.components.preprocess import add_tokens
# At the end of your recipe
nlp = spacy.load(spacy_model)
stream = add_tokens(nlp, stream)
One thing that's important to note here: the tokenization used here should match the tokenization of your custom model and allow the entities to be valid token spans. So if your model uses a custom tokenizer, you might want to use that instead and create the "tokens"
property yourself – you can find the format in the "Annotation task formats" section in your PRODIGY_README.html
.
To only send out one example instead of one example per span, your get_stream
, could be simplified like this:
def get_stream(stream):
for eg in stream:
spans_from_model = get_spans_from_your_model(eg["text"])
eg["spans"] = [{"start": start_char, "end": end_char, "label": label}
for start_char, end_char, label in spans_from_model]
yield eg
In the return
statement of your recipe, you can now change the view_id
and add the labels:
return {
"view_id": "ner_manual",
"dataset": dataset,
"stream": stream,
"config": {
"labels": ["SOME_LABEL", "FOO", "BAR"]
}
}
You should now see all entities in the text highlighted and editable, with the list of labels as selectable options on top.