prodigy pipeline usage

I’m having difficulty using a custom pipeline with prodigy. I have a simple pass to merge some tokens:

class TokenizerCleanup(object):
    def __init__(self, nlp, **cfg):
        self.nlp = nlp
        self.cve_matcher = spacy.matcher.Matcher(nlp.vocab)
        cve_pattern = [
            {"TEXT": {"REGEX": "^CVE-\d{3,4}$"}},
            {"IS_SPACE": True, "OP": "?"},
            {"ORTH": "-"},
            {"IS_SPACE": True, "OP": "?"},
            {"IS_DIGIT": True},
        ]
        self.cve_matcher.add("cve", None, cve_pattern)

    def __call__(self, doc):
        matches = self.cve_matcher(doc)
        print("tokenizer cleanup")
        print(f"{len(matches)} matches, len={len(list(doc))}")

        with doc.retokenize() as retokenizer:
            for match_id, start, end in matches:
                print(f"{doc[start:end]}")
                retokenizer.merge(doc[start:end])
        print(f"len={len(list(doc))}")
        return doc

This is loaded from an entry point following https://spacy.io/usage/saving-loading#entry-points-components.

I can save it to disk and run it in a jupyter notebook

ensm = spacy.load('en_core_web_sm')
retokenizer = ensm.create_pipe('tokenizer_cleanup')
ensm.add_pipe(retokenizer, name="tokenizer_cleanup", first=True)
ensm.to_disk('/tmp/en_core_web_sm_tok')

test = spacy.load('/tmp/en_core_web_sm_tok')
[t for t in test('CVE-2010-1234')]

prints

tokenizer cleanup
1 matches, len=3
CVE-2010-1234
len=1
[CVE-2010-1234]

When I run prodigy ner.make-gold testds /tmp/en_core_web_sm_tok/ /tmp/file.jsonl -U with the input file

{ “text”: “CVE-2010-1234” }

the same output as above is printed, but the UI doesn’t reflect the new tokenization. I can select the spans for the original three tokens instead of the one token returned from the cleanup call. Did I miss a step somewhere?

Hi! The problem here is that your custom tokenizer isn't just implementing different regular expressions – it's also using custom code. When you serialize the nlp object, spaCy will serialize any custom regular expressions, which are JSON-serializable and can be stored in the msgpack format. But it's not just dumping (and later evaling!) arbitrary code.

If this is what you want, you'd have to wrap your model as a Python package and then add your tokenization code in there – for example, by editing the package's __init__.py. The module's load() method needs to return an initialized nlp object, but before you do that, you can modify it and overwrite things like nlp.tokenizer.

You can then install your model package into the same environment as Prodigy, and you'll be able to use it as expected.

I’m a bit more confused now… Trying to follow the spacy 2.1 docs, where do I go astray with:

  1. The TokenizerCleaner object is installed in package with spacy_factories entrypoint in the conda environment along with prodigy and spacy
  2. When I save a model with a modified pipeline, the list of component names are persisted in the meta.json
  3. When the model is loaded, the components are instantiated based on the names stored in meta.json. For my custom component the entrypoint has the package.path to create an instance of the custom component.
  4. If there was a problem with using the custom code, shouldn’t there be some error at load or run time? The code seems to load and run fine in both spacy and prodigy, it’s just that prodigy doesn’t seem to use it’s output.

I’m trying to add a pass to cleanup tokenizer output rather than write a new tokenizer. Does this matter?

Also, how does prodigy.components.preprocess work? Does it use pipeline phases after the initial tokenization?

FWIW, the setup.py looks like:

from setuptools import setup, find_packages
setup(
    name="myproj",
    version="1.0",
    entry_points={
        "spacy_factories": [ "tokenizer_cleanup = myproj.tokenizer_cleanup:TokenizerCleanup" ]
    },
    packages=find_packages())

Ahh, sorry, my bad – it was already late when I wrote that comment :sweat_smile: I think I slightly misread your initial post and for some reason assumed you’re also overwriting nlp.tokenizer with your own custom Tokenizer.

Okay, so it looks like you’ve been doing everything correctly, but as the examples flow through the recipe and get processed with the nlp object, there seems to be at least one point where your custom pipeline component doesn’t run.

I also think I might have found where this happens: the add_tokens preprocessor calls nlp.make_doc to create a Doc object from the raw text and assumes that this does everything required to turn a string into a list of tokens. This is typically true, and by default, nlp.make_doc runs the tokenizer – but in your case, you’re also adding more custom tokenization in a pipeline component in a later step.

Some things you could try:

  1. Write your own custom add_tokens wrapper that runs at least the tokenizer and your custom component. Here’s a little hack you can use:
from prodigy.components.preprocess import _add_tokens

def custom_add_tokens(nlp, stream, skip=False):
    for eg in stream:
        # If this is too slow, you can also disable more components
        doc = nlp(eg["text"])
        _add_tokens(eg, doc, skip)
        yield eg
  1. Put your “cleanup” merging logic in nlp.tokenizer or nlp.make_doc, so it’s okay again to assume that nlp.make_doc produces the final tokens. However, it’s a bit more work, since you do have to add this to your model package.

Thanks for the explanation/advise! It was easy enough to add it to a model package.

1 Like