Create new entities from regex

ines · January 25, 2019, 12:31pm

Hi! We try to do our best and answer questions as soon as possible, and I usually put a lot of effort into my replies. However, we can’t guarantee instant replies and help with your implementation. You posted your question late at night my time, and already bumped the thread at noon my time. This really isn’t productive.

You can also always use the search function (button in the top right corner) to see if a question has already been answered before. For example, if you type in “regex”, you’ll find threads related to using regular expressions: https://support.prodi.gy/search?q=“regex” The first result actually shows a very similar approach and solution.

If you just want to stream in regex matches and annotate whether they are correct / suitable training data or not, the easiest way would be to write a function that takes the incoming stream of examples, finds matches in the texts and creates an annotation example with a "span" for each match (see the “Annotation task formats” in your PRODIGY_README.html for details on the JSON format).

Here’s a simple example:

import re
import copy

expression = re.compile(YOUR_REGEX_HERE)
label = 'ORG'  # or any other label

def regex_matcher(stream):
    for eg in stream:
        for match in re.finditer(expression, eg['text']):  # find match in example text
            task = copy.deepcopy(eg)  # match found – copy the example
            start, end = match.span()  # get matched indices
            task['spans'] = [{'start': start, 'end': end, 'label': label}]  # label match
            yield task

Here’s a custom recipe template to get you started:

github.com

explosion/prodigy-recipes/blob/master/other/mark.py

# coding: utf8
from __future__ import unicode_literals

import prodigy
from prodigy.components.loaders import JSONL
from prodigy.util import split_string
from collections import Counter


# Recipe decorator with argument annotations: (description, argument type,
# shortcut, type / converter function called on value before it's passed to
# the function). Descriptions are also shown when typing --help.
@prodigy.recipe('mark',
    dataset=("The dataset to use", "positional", None, str),
    source=("The source data as a JSONL file", "positional", None, str),
    view_id=("ID of annotation interface", "option", "o", str),
    exclude=("Names of datasets to exclude", "option", "e", split_string)
)
def mark(dataset, source, view_id, exclude=None):
    """

This file has been truncated. show original

Using the view_id "ner", you can render the examples as highlighted entities, and then accept or reject them. The annotations will then be saved to the given dataset, and you can then use them to update a model.

Topic		Replies	Views
Off-track use of Prodigy/Spacy - Custom Regex Pattern Matching and Modeling usage , ner , spacy , custom	35	7592	February 4, 2019
Training NER model from scratch using (forward-looking) patterns usage	8	692	December 17, 2019
merging a data annotated by regex with the annotated data by prodigy usage , ner , spacy	1	484	August 7, 2019
NER Training for Corporate Names ner , best-practices	22	11388	September 4, 2019
annotating entities in text documents usage , ner , solved	15	9932	November 28, 2017

Create new entities from regex

Related topics