terms.to-patterns looks strange

ines · October 23, 2018, 3:57pm

Sorry if this was confusing – the terms.to-patterns recipe is designed to convert a dataset of single terms to a patterns file – fore example, a dataset created with terms.teach, which would include examples like "text": "Apple". That patterns file can then be used to bootstrap training in ner.teach and make sure the model sees enough positive suggestions.

Creating patterns from existing annotations is a good idea, though – you could even use ner.manual to label a few texts manually and then convert the highlighted spans to patterns. There's no built-in recipe for this, but writing your own converter is pretty straightforward. Essentially, all you have to do is load the dataset, get the accepted annotations and use the "spans" property (highlighted text) to extract the entity text and add it to the list of patterns:

from prodigy.components.db import connect
from prodigy.util import write_jsonl

db = connect()  # connect to DB with setting sfrom prodigy.json
examples = db.get_dataset('my_set')  # load the dataset

patterns = []
for eg in examples:  # iterate over the annotations
    if eg['answer'] == 'accept':  # we only want accepted entities
        spans = eg.get('spans', [])  # get the annotated spans
        for span in spans:
            # get the highlighted text and create a pattern
            text = eg['text'][span['start']:span['end']]
            patterns.append({'pattern': text, 'label': span['label']})

write_jsonl('/path/to/patterns.jsonl', patterns)

The above example only creates patterns for exact string matches, e.g. "pattern": "Apple". If you want case-insensitive token-based matching, you can use spaCy to tokenize the text for you and create a pattern this way:

text = eg['text'][span['start']:span['end']]
doc = nlp(text)
tokens = [{'lower': token.lower_} for token in doc]
patterns.append({'pattern': tokens, 'label': span['label']})

You can also check out this thread, which discusses a similar approach and solution for creating patterns:

Topic		Replies	Views
terms.to-patterns with existing data terms , solved	10	2797	May 29, 2019
Train a new NER entity with multi-word tokens usage , ner , solved	15	9675	September 10, 2019
✨ Video: Training a new entity type with Prodigy ner	2	1146	May 26, 2021
ner.batch-train does not suggest any match based on the provided pattern file ner	3	675	September 25, 2018
Extending existing entity type with patterns usage , ner , solved , best-practices	4	1656	June 27, 2018

terms.to-patterns looks strange

Related topics