Hi! Sorry if this wasn't fully clear β I'll see if we can add the pattern file details more prominently in the docs
The good news is, spaCy patterns are fully compatible with Prodigy. So in order to use your existing patterns, all you have to do is create a file like patterns.jsonl
containing one object per line, each with a key "label"
and "pattern"
. For example:
{"label": "YOUR_LABEL", "pattern": [{"IS_ASCII": true}, {"ORTH": "-"}, {"IS_ASCII": true}]}
This is also the same format used by spaCy's new EntityRuler
btw β so if you've been working with that, you can reuse the exact same patterns files.
To test your patterns, you can use the ner.match
recipe, which will show you all matches in the data and ask you to accept / reject them. For example:
prodigy ner.match your_dataset en_core_web_sm /path/to/your_data.jsonl /path/to/patterns.jsonl --label YOUR_LABEL
The ner.make-gold
workflow currently doesn't have a --patterns
argument β it really only goes through the doc.ents
set by a spaCy model, pre-highlights them in the texts and lets you correct those entities manually. However, thanks to spaCy v2.1 and the new EntityRuler
, you can still make this work:
- Create a new
EntityRuler
and add your patterns to it (see here for more info). - Load a pre-trained model and add the entity ruler to the pipeline.
- Save the modified model with the entity ruler to disk using
nlp.to_disk
β the entity ruler and its patterns will be serialized automatically and loaded back in when you load the model. Thedoc.ents
set by that model now include the pattern matches. - Load the saved model into
ner.make-gold
and annotate entity predictions plus pattern matches.
prodigy ner.make-gold your_dataset /path/to/saved-model /path/to/your_data.jsonl --label YOUR_LABEL