Does tokenise patterns?

I tried to use recipe to generate patterns.jsonl file whereas it looks like the recipe does not generate tokenised patterns but rather keep strings as they are.

Example usage is:

prodigy programming_langs patterns.jsonl --label PROG_LANG

and sample output looks like this:

{"label":"PROG_LANG","pattern":[{"lower":"Python 3"}]}

Unfortunately the above pattern does not match "Python 3" in the text. I'm aware that I can manually add a pattern like this:

{"label": "LANGS", "pattern": [{"LOWER": "python"}, {"NUM_LIKE": True, "OP": "?"}]},

But this is not very useful in this particular case as my intention is to programmatically add many multi-token strings and generate a patterns file without worrying about different use-cases. So is there are an easy way to generate something like this using

{"label": "LANGS", "pattern": [{"LOWER": "python"}, {"LOWER": "3"}]},

The reason for that is to leverage recipe to automate some pattern generation process.

Thanks in advance.

At the moment it doesn't tokenize patterns, no – it also doesn't take any model, just a dataset. The main reason is that the recipe was originally developed for the terms.teach recipe, which iterates over the model's vocab and vectors. So we can assume that they're single tokens.

If you do want to create your patterns programmatically, you could just write a little script that does this:

nlp = spacy.blank("en")  # or whichever language
terms = ["Python 3", "Python 2"]
patterns = []
for doc in nlp.pipe(terms):
    pattern = [{"LOWER": token.lower_} for token in doc]
1 Like

Thanks @ines for the clarification. I'll go with the script approach. I understand that the intention was to compliment terms.teach and it totally make sense now. However I still believe it would have been great to have a built-in recipe that do the tokenize part or a parameter to add this functionality to the current recipe. I quite like your recipe approach and ecosystem. It's so super intuitive, nice and tidy, keep up great work. Thanks

1 Like