patterns using regex or shape

As a first attempt at working with patterns I am trying to implement a simple search for zipcode and am having trouble with the pattern.
I am putting the patterns in a jsonl file and running a command like the following:

prodigy ner.teach zipcode_db en_core_web_sm  [path to .jsonl]  --label ZIPCODE --patterns zipcode_pattern.jsonl

The closest I can get to work is:

 {"IS_DIGIT":true}],"label":"ZIPCODE"}

but that only matches any number, not numbers 5 digits long.
I have also tried :

{"SHAPE":"ddddd"}],"label":"ZIPCODE"}
{"shape":"ddddd"}],"label":"ZIPCODE"}
{"pattern":[ {"IS_DIGIT":true} ,{"SHAPE":"XXXXX"}],"label":"ZIPCODE"}

If I wanted to use a regex expression for the same problem how would I do that?
Can I use shape to get this to work?

I have seen “shape” written as SHAPE and shape, which is correct? I cannot seem to find documentation for it.

thanks for the help!

The keys should be case insensitive, so either "shape" or "SHAPE" is fine. If you’re assembling the dict in Python, you might also find it convenient to import the numeric ID from spacy.symbols.

The easiest way to write the pattern would be to take an example of the text you want to match, and make a doc object, ideally in the interpreter (or a Jupyter notebook). Then you can find the values of the attributes, e.g. the shape_ attribute. You could also specify IS_DIGIT: True and LENGTH: 5 if you want.

We don’t currently support regular expressions in the match rules. The main alternative is to define a new binary flag, which you add with nlp.vocab.add_flag(). The flag function should take a string as its argument, and return a boolean value. This way, you can indicate whether the token’s text matches a regular expression.

This approach isn’t so convenient if you’re passing in a patterns.jsonl file and using the built-in recipes. You have to add the flag at the start of the recipe, and then this will give you the numeric ID of the flag which you can insert into your patterns.

From your comments I did not understand what was wrong with the {“SHAPE”:“ddddd”}],“label”:“ZIPCODE”} pattern. I would like to understand so I can use shape for more complicated cases.

I ended up using {“pattern”:[ {“IS_DIGIT”:true ,“LENGTH”:5} ],“label”:“ZIPCODE”}
[ {“IS_DIGIT”:true}, {“LENGTH”:5} ] returned a digit followed by a string of length 5.

When doing the following:
nlp = spacy.load(‘en’)
doc = nlp(u’12345’)
and using tabcomplete for doc ----- I looked through most of these but could not find any nice attributes of the type you were describing.

Could you please elaborate? Thanks!

You want {"IS_DIGIT": true, "LENGTH": 5} -- you're applying two predicates to specify one token, so they go in the same dict. If you provide them in two dicts, you're specifying two tokens.

The following works:


>>> matcher.add('ZIP', None, [{'shape': 'dddd'}])
>>> doc = nlp(u'Beverley Hills 90210')
>>> matcher(doc)
[(7711403427203968788, 2, 3)]

So it's possible you had a similar problem before?

I might be answering the wrong question here, but possibly you want to get a token object with token = doc[0] and tab complete through that?

I'm a bit nervous that the tab completion might miss things, however --- because spaCy is written in Cython, sometimes the automatic code inspection fails. I also never use tab completion, so I don't see problems as they occur.

Yes, thank you!

One strange thing I am finding:
if:
doc = nlp(u'123')
then
doc[0].shape_ = 'ddd'

but if the string is any number longer than 4 digits:
doc = nlp(u'123456')
doc[0].shape_ = 'dddd'

Will this serve as a filtering step for the model and/or the annotations?

When predicting if a given string is a certain entity I would like for both the model AND the annotation procedure to prefilter on the pattern.

For the zipcode example, I would like to annotate only strings that are 5 digits long. I would like the model to never predict that a string is a zipcode unless it is a 5 digit string (I have other 5 digit strings that are not zipcodes in my corpus).

How would I set this up?

It might be best to wrap the ner.teach recipe, so that you can modify the stream. You can find more about writing a wrapper here: https://prodi.gy/docs/workflow-custom-recipes#example-wrapping

The basic idea is you’re just going to write a recipe function that calls into the ner.teach recipe, and gets the components it returns. This allows us to intercept the tasks before they’re passed to the REST API, to prevent questions from being asked if we can figure out the answer in some easy way. Additionally, we can auto-reject the dropped tasks, so that they’re stored in the dataset and used to update the model.

I think this approach should give you what you want — it’s basically auto-answering the questions.

import prodigy
from prodigy.recipes.ner import teach

@prodigy.recipe('custom.ner.teach', **teach.__annotations__)
def custom_ner_teach(dataset, spacy_model, source=None, api=None, loader=None,
          label=None, patterns=None, exclude=None):
    """Custom wrapper for ner.teach recipe that replaces the stream."""
    components = teach(**locals())
    
    original_stream = components['stream']
    original_update = update
    bad_spans = []
    def get_modified_stream():
        nonlocal bad_zips
        for eg in original_stream:
            for span in eg['spans']:
                if span['label'] == 'ZIP' and len(span['text']) != 5:
                    eg['answer'] = 'reject'
                    bad_spans.append(eg)
                    break
            else:
                yield eg
        
    def modified_update(batch):
        nonlocal bad_spans
        batch = batch + bad_spans
        bad_spans = []
        return original_update(batch)

    components['stream'] = get_modified_stream()
    components['update'] = modified_update
    return components

This is part of the definition of the SHAPE feature: to make word shapes less sparse, contiguous sequences are clipped at 4. Otherwise the shape would be sensitive to the exact length of the word.

The word shape feature was originally created for NER using linear models. The definition is a bit arbitrary --- it's just what people have found to work well.

This is great and gives me a good idea of how to clean the models predicted entities after it has run but any idea of how to make it so the model is restricted to predicting only if a pattern is matched?

I would think that with your methodology, because you are not rejecting number of an inappropriate length it the model will be more likely to give bad predicted zipcode labels.

There will be a big difference in terms of model performance of doing the pattern matching before vs. during model training. For instance if you are training on multiple labels and an incorrect label of zipcode will effect the surrounding tokens predicted labels.

Well, we're actually automatically labelling the examples and passing them into the update() function. So, we're training the model not to make those predictions, just as if you'd clicked "reject" on them. If the model continually predicts ZIP when the length is only 3, it'll keep getting those as negative examples, forcing it to learn not to do that.

As written it is not updating the db, even when saving from the interface or when several batches have been processed. I do not know why. I ended up updating the db in the custom teach method ( as well as made it applicable to all pattern files).

import prodigy
from prodigy.recipes.ner import teach
from prodigy.components.db import connect

import spacy
from spacy.matcher import Matcher
import json


@prodigy.recipe('must_match_pattern.ner.teach',
    patterns=prodigy.recipe_args['patterns'],
    dataset=prodigy.recipe_args['dataset'],
    spacy_model=prodigy.recipe_args['spacy_model'],
    database=("Database to connect to", "positional", None, str),
    label=prodigy.recipe_args['label'])
def custom_ner_teach(dataset, spacy_model, database, patterns, label):
    """Custom wrapper for ner.teach recipe that replaces the stream.
        Automatically rejects a suggested annotation if it does 
            not match from the patterns file

    """
    components = teach(dataset=dataset, spacy_model=spacy_model,
                       source=database, label=label, patterns=patterns)

    original_stream = components['stream']
    original_update = components['update']

    # add all the patterns to the matcher
    nlp = spacy.load('en')
    matcher = Matcher(nlp.vocab)

    # read in patterns file and for each pattern add it to spacy matcher
    with open(str(patterns), "r") as f:
        for line in f:
            print (line)
            matcher.add(label, None, json.loads(line)['pattern'])

    bad_spans = []

    def get_modified_stream():
        nonlocal bad_spans  # want to update this outside of the function
        j = 0
        for eg in original_stream:
            # import ipdb; ipdb.set_trace()
            is_rejected = False
            for span in eg['spans']:
                doc = nlp(span['text'])
                matches = matcher(doc)

                # has to have appropriate label and not be a match in order to reject
                if span['label'] == label and matches == []:
                    eg['answer'] = 'reject'  # auto-reject
                    is_rejected = True
                    if j % 10 == 0:
                        print('rejected', str(j), ' spans that did not match the pattern so far')
                    j += 1

            if is_rejected:
                bad_spans.append(eg)
                continue
            else:
                yield eg

    def modified_update(batch):
        nonlocal bad_spans
        batch=batch + bad_spans

        # update db with rejects
        update_db(bad_spans)
        # reset rejects
        bad_spans=[]
        return original_update(batch)

    def update_db(bad_spans):
        db=connect()
        # data = db.get_dataset(dataset)
        db.add_examples(bad_spans, datasets=[dataset])
        print ('added ', len(bad_spans), ' to db')


    components['stream']=get_modified_stream()
    components['update']=modified_update
    components['config']['label']=label  # hack to fix incorrect labeling of label

    return components

usage looks like:

prodigy must_match_pattern.ner.teach [db-name] [model path]  [data source path] --label [label name] --patterns [path to patternsfile] -F special_filter.py

I am getting a 30x speed up in terms of annotation rate - you guys should add something like this for the next version. Maybe just add a must_match_pattern flag to teach.

Also, sorry I can't get the code to look pretty - maybe you can fix it?

Can't see why that isn't updating the DB either! Will have a look, thanks for updating.

Glad it's working! We're doing a lot of work on spaCy's Matcher for v2.1 -- @ines has a good summary of the new features: 💫 Better, faster and more customisable matcher · Issue #1971 · explosion/spaCy · GitHub . I think having a variety of ways to bootstrap the annotation with rules is a strong strategy. We're thinking of ways to add data augmentation features as well.

Done. You need to wrap the code block in "backticks", like this:

```python

`` `

(Except I don't know how to escape them...So I added a space in the closing three.)

bad_zips is used as variable and bad_spans is updated. That is why?

@SandeepNaidu That’s definitely a bug, yes — but that should mean the thing would crash. This suggests it never executes that line. Hmm.

Anyway I’ll fix that line, thanks.