Sorry if this was confusing – the terms.to-patterns
recipe is designed to convert a dataset of single terms to a patterns file – fore example, a dataset created with terms.teach
, which would include examples like "text": "Apple"
. That patterns file can then be used to bootstrap training in ner.teach
and make sure the model sees enough positive suggestions.
Creating patterns from existing annotations is a good idea, though – you could even use ner.manual
to label a few texts manually and then convert the highlighted spans to patterns. There's no built-in recipe for this, but writing your own converter is pretty straightforward. Essentially, all you have to do is load the dataset, get the accepted annotations and use the "spans"
property (highlighted text) to extract the entity text and add it to the list of patterns:
from prodigy.components.db import connect
from prodigy.util import write_jsonl
db = connect() # connect to DB with setting sfrom prodigy.json
examples = db.get_dataset('my_set') # load the dataset
patterns = []
for eg in examples: # iterate over the annotations
if eg['answer'] == 'accept': # we only want accepted entities
spans = eg.get('spans', []) # get the annotated spans
for span in spans:
# get the highlighted text and create a pattern
text = eg['text'][span['start']:span['end']]
patterns.append({'pattern': text, 'label': span['label']})
write_jsonl('/path/to/patterns.jsonl', patterns)
The above example only creates patterns for exact string matches, e.g. "pattern": "Apple"
. If you want case-insensitive token-based matching, you can use spaCy to tokenize the text for you and create a pattern this way:
text = eg['text'][span['start']:span['end']]
doc = nlp(text)
tokens = [{'lower': token.lower_} for token in doc]
patterns.append({'pattern': tokens, 'label': span['label']})
You can also check out this thread, which discusses a similar approach and solution for creating patterns: