terms.to-patterns looks strange

solved
terms
(Mindaugas) #1

Hi guys,

I am very new to Prodigy and while exploring a suspicious thing got my attention. I tried to improve the existing model and followed the steps described here (https://prodi.gy/docs/workflow-first-steps). So basically:
python -m prodigy dataset my_set “Playground” --author Me
python -m prodigy ner.teach my_set en_core_web_sm news_headlines.jsonl --label ORG
python -m prodigy terms.to-patterns my_set out.jsonl --label ORG

Exported patterns look like:
{“label”:“ORG”,“pattern”:[{“lower”:“The War Between Apple and Google Has Just Begun”}]}
{“label”:“ORG”,“pattern”:[{“lower”:“Uber\u2019s Lesson: Silicon Valley\u2019s Start-Up Machine Needs Fixing”}]}

My question is are the patterns correct? I would expect patterns to be something like (as I accepted only names of companies as an entity of organization):
{“label”:“ORG”,“pattern”:[{“lower”:“Apple”}]}
{“label”:“ORG”,“pattern”:[{“lower”:“Google”}]}
Thanks in advance.

0 Likes

(Ines Montani) #2

Sorry if this was confusing – the terms.to-patterns recipe is designed to convert a dataset of single terms to a patterns file – fore example, a dataset created with terms.teach, which would include examples like "text": "Apple". That patterns file can then be used to bootstrap training in ner.teach and make sure the model sees enough positive suggestions.

Creating patterns from existing annotations is a good idea, though – you could even use ner.manual to label a few texts manually and then convert the highlighted spans to patterns. There’s no built-in recipe for this, but writing your own converter is pretty straightforward. Essentially, all you have to do is load the dataset, get the accepted annotations and use the "spans" property (highlighted text) to extract the entity text and add it to the list of patterns:

from prodigy.components.db import connect
from prodigy.util import write_jsonl

db = connect()  # connect to DB with setting sfrom prodigy.json
examples = db.get_dataset('my_set')  # load the dataset

patterns = []
for eg in examples:  # iterate over the annotations
    if eg['answer'] == 'accept':  # we only want accepted entities
        spans = eg.get('spans', [])  # get the annotated spans
        for span in spans:
            # get the highlighted text and create a pattern
            text = eg['text'][span['start']:span['end']]
            patterns.append({'pattern': text, 'label': span['label']})

write_jsonl('/path/to/patterns.jsonl', patterns)

The above example only creates patterns for exact string matches, e.g. "pattern": "Apple". If you want case-insensitive token-based matching, you can use spaCy to tokenize the text for you and create a pattern this way:

text = eg['text'][span['start']:span['end']]
doc = nlp(text)
tokens = [{'lower': token.lower_} for token in doc]
patterns.append({'pattern': tokens, 'label': span['label']})

You can also check out this thread, which discusses a similar approach and solution for creating patterns:

0 Likes

(Mindaugas) #3

Things are starting to make sense. Thanks for a quick answer!

0 Likes