How to use a spaCy pattern in Prodigy

In my spaCy models, I need to not split on dashes and I've used the following pattern:

{"IS_ASCII": True}, {"ORTH": "-"}, {"IS_ASCII": True}

As a brand new user to Prodigy however, I can't figure how how to incorporate this into my ner.make-gold training. It looks like I need to make a pattern file, but I'm not sure where to pass that and what the format of it is supposed to be. Thanks for the help!

I've looked in the forums already for:

Hi! Sorry if this wasn't fully clear – I'll see if we can add the pattern file details more prominently in the docs :slightly_smiling_face:

The good news is, spaCy patterns are fully compatible with Prodigy. So in order to use your existing patterns, all you have to do is create a file like patterns.jsonl containing one object per line, each with a key "label" and "pattern". For example:

{"label": "YOUR_LABEL", "pattern": [{"IS_ASCII": true}, {"ORTH": "-"}, {"IS_ASCII": true}]}

This is also the same format used by spaCy's new EntityRuler btw – so if you've been working with that, you can reuse the exact same patterns files.

To test your patterns, you can use the ner.match recipe, which will show you all matches in the data and ask you to accept / reject them. For example:

prodigy ner.match your_dataset en_core_web_sm /path/to/your_data.jsonl /path/to/patterns.jsonl --label YOUR_LABEL

The ner.make-gold workflow currently doesn't have a --patterns argument – it really only goes through the doc.ents set by a spaCy model, pre-highlights them in the texts and lets you correct those entities manually. However, thanks to spaCy v2.1 and the new EntityRuler, you can still make this work:

  • Create a new EntityRuler and add your patterns to it (see here for more info).
  • Load a pre-trained model and add the entity ruler to the pipeline.
  • Save the modified model with the entity ruler to disk using nlp.to_disk – the entity ruler and its patterns will be serialized automatically and loaded back in when you load the model. The doc.ents set by that model now include the pattern matches.
  • Load the saved model into ner.make-gold and annotate entity predictions plus pattern matches.
prodigy ner.make-gold your_dataset /path/to/saved-model /path/to/your_data.jsonl --label YOUR_LABEL

Thanks for the help @ines! I feel like I’m much closer, and while the EntityRuler recognizes my pattern, the text displayed to the user is still segmented. Here’s a complete minimal working example trying to follow the instructions you laid out:

First create a custom model and test that it works:

from spacy.lang.en import English
from spacy.pipeline import EntityRuler

nlp = English()
ruler = EntityRuler(nlp)

keep_dash = [{"IS_ASCII": True}, {"ORTH": "-"}, {"IS_ASCII": True}]
patterns = [{"label":"keep_dash", "pattern":keep_dash}]

ruler.add_patterns(patterns)
nlp.add_pipe(ruler)
nlp.to_disk('custom_model')

doc = nlp(u"Funding support from fictional grant P50 CA032133-06A7 (Hoppe).")
print([(ent.text, ent.label_) for ent in doc.ents])

this has the output as expected: [('CA032133-06A7', 'keep_dash')]. My pattern has matched and is one token. Now create a simple jsonl file to test with a single line:

{"text":"Funding support from fictional grant P50 CA032133-06A7 (Hoppe)."}

Running the matcher with

prodigy ner.manual segmentation_task custom_model problem_with_dashes.jsonl --label "GRANT_NUM"

Still seems to tokenize the pattern:

How can I get this to be displayed as a single token to the user, the way the original text has it?

Glad to hear my reply was helpful!

One thing that's very important to note here: the matcher or entity ruler does not change the tokenization! Your pattern matches 3 tokens, and the matched span CA032133-06A7 still consists of three tokens. The tokens aren't all followed by whitespace, but they're still three separate tokens. (You can check this by printing [token.text for token in doc] or [token.text for token in ent], the entity span). The entity ruler only helps find those tokens, creates a Span object containing those tokens and a label and adds that span to the doc.ents.

I think you want to be running ner.make-gold instead of ner.manual?

The ner.manual recipe will just stream in the text and ask you to highlight things manually. The model will be used only for tokenization, i.e. splitting the text into tokens to make highlighting easier (because the selection can snap to the token boundaries). When you highlight one or more tokens, they'll be displayed with a yellow background. But they'll still be separate tokens. Prodigy just displays them slightly spaced out, so it's easier to see that they are separate tokens.

The ner.make-gold recipe helps you label more efficiently because it also uses the entities assigned by the model and pre-highlights them. So if your model already assigns entities (via the statistical entity recognizer or a rule-based process like the entity ruler), this makes annotation faster, because ideally, you need to label less. All entity spans in the doc.ents will be added as "spans" in Prodigy and pre-highlighted.