I have made a script to auto-reject samples that do not match a pattern from a pattern file when annotating - this works okay for some entities, such as zip-code (I have other 5 digit numbers that aren’t zip-codes). But ultimately it relies on giving the model lots of rejected examples of the entity. For other entities the model will have a much harder time learning to reject a token - if doesn’t match the pattern, for instance if our entity has to match one of a large list.
What would be much better is if our model knew whether or not a given span matches a pattern! To that end I have created a custom component, that adds an attribute signifying if the span matches the pattern.
First Question) Will the NER “know” about my custom attributes? i.e. will it be encoded in the tensor as input?
Second Question) When loading the model using ner.batch-train I am getting the following error:
KeyError: “Can’t find factory for ‘pattern_detector’.”
I read your comment from [Load error after adding custom textcat model to the pipeline]
but I don’t understand where/how to add the factory.
class Pattern_Matcher(object):
def __init__(self,nlp, label):
self.vocab = nlp.vocab
self.entityname = label
self.label = nlp.vocab.strings[self.entityname]
self.matcher = Matcher(nlp.vocab)
self.name = "pattern_detector"
self.nlp = nlp
self.fill_matcher_w_patterns()
Token.set_extension('is_' + self.entityname , default=False)
def fill_matcher_w_patterns(self):
pattern_path = '/data/prodigy/patterns/'+self.entityname+'_pattern.jsonl'
patterns = []
with open(str(pattern_path), "r") as f:
for line in f:
print (line)
label =json.loads(line)['label']
p = json.loads(line)['pattern']
self.matcher.add(label, None, p)
print('done adding patterns to matcher')
def __call__(self,doc):
matches = self.matcher(doc)
spans = []
for _, start, end in matches:
entity = Span(doc,start, end, label=self.label)
spans.append(entity)
for token in entity:
token._.set('is_' + self.entityname, True)
for span in spans:
span.merge()
return doc
def save_model():
nlp = spacy.load('en_core_web_lg')
component = Pattern_Matcher(nlp) # initialise component
nlp.add_pipe(component, first =True)
nlp.factories["pattern_detector"] = lambda nlp, **cfg: Pattern_Matcher(nlp, label,**cfg)
print('Pipeline', nlp.pipe_names)
nlp.to_disk('/data/prodigy/models/zip_me')
'''