As a first attempt at working with patterns I am trying to implement a simple search for zipcode and am having trouble with the pattern.
I am putting the patterns in a jsonl file and running a command like the following:
The keys should be case insensitive, so either "shape" or "SHAPE" is fine. If you’re assembling the dict in Python, you might also find it convenient to import the numeric ID from spacy.symbols.
The easiest way to write the pattern would be to take an example of the text you want to match, and make a doc object, ideally in the interpreter (or a Jupyter notebook). Then you can find the values of the attributes, e.g. the shape_ attribute. You could also specify IS_DIGIT: True and LENGTH: 5 if you want.
We don’t currently support regular expressions in the match rules. The main alternative is to define a new binary flag, which you add with nlp.vocab.add_flag(). The flag function should take a string as its argument, and return a boolean value. This way, you can indicate whether the token’s text matches a regular expression.
This approach isn’t so convenient if you’re passing in a patterns.jsonl file and using the built-in recipes. You have to add the flag at the start of the recipe, and then this will give you the numeric ID of the flag which you can insert into your patterns.
From your comments I did not understand what was wrong with the {“SHAPE”:“ddddd”}],“label”:“ZIPCODE”} pattern. I would like to understand so I can use shape for more complicated cases.
I ended up using {“pattern”:[ {“IS_DIGIT”:true ,“LENGTH”:5} ],“label”:“ZIPCODE”}
[ {“IS_DIGIT”:true}, {“LENGTH”:5} ] returned a digit followed by a string of length 5.
When doing the following:
nlp = spacy.load(‘en’)
doc = nlp(u’12345’)
and using tabcomplete for doc ----- I looked through most of these but could not find any nice attributes of the type you were describing.
You want {"IS_DIGIT": true, "LENGTH": 5} -- you're applying two predicates to specify one token, so they go in the same dict. If you provide them in two dicts, you're specifying two tokens.
So it's possible you had a similar problem before?
I might be answering the wrong question here, but possibly you want to get a token object with token = doc[0] and tab complete through that?
I'm a bit nervous that the tab completion might miss things, however --- because spaCy is written in Cython, sometimes the automatic code inspection fails. I also never use tab completion, so I don't see problems as they occur.
Will this serve as a filtering step for the model and/or the annotations?
When predicting if a given string is a certain entity I would like for both the model AND the annotation procedure to prefilter on the pattern.
For the zipcode example, I would like to annotate only strings that are 5 digits long. I would like the model to never predict that a string is a zipcode unless it is a 5 digit string (I have other 5 digit strings that are not zipcodes in my corpus).
The basic idea is you’re just going to write a recipe function that calls into the ner.teach recipe, and gets the components it returns. This allows us to intercept the tasks before they’re passed to the REST API, to prevent questions from being asked if we can figure out the answer in some easy way. Additionally, we can auto-reject the dropped tasks, so that they’re stored in the dataset and used to update the model.
I think this approach should give you what you want — it’s basically auto-answering the questions.
import prodigy
from prodigy.recipes.ner import teach
@prodigy.recipe('custom.ner.teach', **teach.__annotations__)
def custom_ner_teach(dataset, spacy_model, source=None, api=None, loader=None,
label=None, patterns=None, exclude=None):
"""Custom wrapper for ner.teach recipe that replaces the stream."""
components = teach(**locals())
original_stream = components['stream']
original_update = update
bad_spans = []
def get_modified_stream():
nonlocal bad_zips
for eg in original_stream:
for span in eg['spans']:
if span['label'] == 'ZIP' and len(span['text']) != 5:
eg['answer'] = 'reject'
bad_spans.append(eg)
break
else:
yield eg
def modified_update(batch):
nonlocal bad_spans
batch = batch + bad_spans
bad_spans = []
return original_update(batch)
components['stream'] = get_modified_stream()
components['update'] = modified_update
return components
This is part of the definition of the SHAPE feature: to make word shapes less sparse, contiguous sequences are clipped at 4. Otherwise the shape would be sensitive to the exact length of the word.
The word shape feature was originally created for NER using linear models. The definition is a bit arbitrary --- it's just what people have found to work well.
This is great and gives me a good idea of how to clean the models predicted entities after it has run but any idea of how to make it so the model is restricted to predicting only if a pattern is matched?
I would think that with your methodology, because you are not rejecting number of an inappropriate length it the model will be more likely to give bad predicted zipcode labels.
There will be a big difference in terms of model performance of doing the pattern matching before vs. during model training. For instance if you are training on multiple labels and an incorrect label of zipcode will effect the surrounding tokens predicted labels.
Well, we're actually automatically labelling the examples and passing them into the update() function. So, we're training the model not to make those predictions, just as if you'd clicked "reject" on them. If the model continually predicts ZIP when the length is only 3, it'll keep getting those as negative examples, forcing it to learn not to do that.
As written it is not updating the db, even when saving from the interface or when several batches have been processed. I do not know why. I ended up updating the db in the custom teach method ( as well as made it applicable to all pattern files).
import prodigy
from prodigy.recipes.ner import teach
from prodigy.components.db import connect
import spacy
from spacy.matcher import Matcher
import json
@prodigy.recipe('must_match_pattern.ner.teach',
patterns=prodigy.recipe_args['patterns'],
dataset=prodigy.recipe_args['dataset'],
spacy_model=prodigy.recipe_args['spacy_model'],
database=("Database to connect to", "positional", None, str),
label=prodigy.recipe_args['label'])
def custom_ner_teach(dataset, spacy_model, database, patterns, label):
"""Custom wrapper for ner.teach recipe that replaces the stream.
Automatically rejects a suggested annotation if it does
not match from the patterns file
"""
components = teach(dataset=dataset, spacy_model=spacy_model,
source=database, label=label, patterns=patterns)
original_stream = components['stream']
original_update = components['update']
# add all the patterns to the matcher
nlp = spacy.load('en')
matcher = Matcher(nlp.vocab)
# read in patterns file and for each pattern add it to spacy matcher
with open(str(patterns), "r") as f:
for line in f:
print (line)
matcher.add(label, None, json.loads(line)['pattern'])
bad_spans = []
def get_modified_stream():
nonlocal bad_spans # want to update this outside of the function
j = 0
for eg in original_stream:
# import ipdb; ipdb.set_trace()
is_rejected = False
for span in eg['spans']:
doc = nlp(span['text'])
matches = matcher(doc)
# has to have appropriate label and not be a match in order to reject
if span['label'] == label and matches == []:
eg['answer'] = 'reject' # auto-reject
is_rejected = True
if j % 10 == 0:
print('rejected', str(j), ' spans that did not match the pattern so far')
j += 1
if is_rejected:
bad_spans.append(eg)
continue
else:
yield eg
def modified_update(batch):
nonlocal bad_spans
batch=batch + bad_spans
# update db with rejects
update_db(bad_spans)
# reset rejects
bad_spans=[]
return original_update(batch)
def update_db(bad_spans):
db=connect()
# data = db.get_dataset(dataset)
db.add_examples(bad_spans, datasets=[dataset])
print ('added ', len(bad_spans), ' to db')
components['stream']=get_modified_stream()
components['update']=modified_update
components['config']['label']=label # hack to fix incorrect labeling of label
return components
I am getting a 30x speed up in terms of annotation rate - you guys should add something like this for the next version. Maybe just add a must_match_pattern flag to teach.
Also, sorry I can't get the code to look pretty - maybe you can fix it?
Can't see why that isn't updating the DB either! Will have a look, thanks for updating.
Glad it's working! We're doing a lot of work on spaCy's Matcher for v2.1 -- @ines has a good summary of the new features: 💫 Better, faster and more customisable matcher · Issue #1971 · explosion/spaCy · GitHub . I think having a variety of ways to bootstrap the annotation with rules is a strong strategy. We're thinking of ways to add data augmentation features as well.
Done. You need to wrap the code block in "backticks", like this:
```python
`` `
(Except I don't know how to escape them...So I added a space in the closing three.)