I’m having difficulty using a custom pipeline with prodigy. I have a simple pass to merge some tokens:
class TokenizerCleanup(object):
def __init__(self, nlp, **cfg):
self.nlp = nlp
self.cve_matcher = spacy.matcher.Matcher(nlp.vocab)
cve_pattern = [
{"TEXT": {"REGEX": "^CVE-\d{3,4}$"}},
{"IS_SPACE": True, "OP": "?"},
{"ORTH": "-"},
{"IS_SPACE": True, "OP": "?"},
{"IS_DIGIT": True},
]
self.cve_matcher.add("cve", None, cve_pattern)
def __call__(self, doc):
matches = self.cve_matcher(doc)
print("tokenizer cleanup")
print(f"{len(matches)} matches, len={len(list(doc))}")
with doc.retokenize() as retokenizer:
for match_id, start, end in matches:
print(f"{doc[start:end]}")
retokenizer.merge(doc[start:end])
print(f"len={len(list(doc))}")
return doc
This is loaded from an entry point following https://spacy.io/usage/saving-loading#entry-points-components.
I can save it to disk and run it in a jupyter notebook
ensm = spacy.load('en_core_web_sm')
retokenizer = ensm.create_pipe('tokenizer_cleanup')
ensm.add_pipe(retokenizer, name="tokenizer_cleanup", first=True)
ensm.to_disk('/tmp/en_core_web_sm_tok')
test = spacy.load('/tmp/en_core_web_sm_tok')
[t for t in test('CVE-2010-1234')]
prints
tokenizer cleanup
1 matches, len=3
CVE-2010-1234
len=1
[CVE-2010-1234]
When I run prodigy ner.make-gold testds /tmp/en_core_web_sm_tok/ /tmp/file.jsonl -U
with the input file
{ “text”: “CVE-2010-1234” }
the same output as above is printed, but the UI doesn’t reflect the new tokenization. I can select the spans for the original three tokens instead of the one token returned from the cleanup call. Did I miss a step somewhere?