I tried to use term.to-patterns recipe to generate patterns.jsonl file whereas it looks like the recipe does not generate tokenised patterns but rather keep strings as they are.
But this is not very useful in this particular case as my intention is to programmatically add many multi-token strings and generate a patterns file without worrying about different use-cases. So is there are an easy way to generate something like this using terms.to-patterns:
At the moment it doesn't tokenize patterns, no – it also doesn't take any model, just a dataset. The main reason is that the recipe was originally developed for the terms.teach recipe, which iterates over the model's vocab and vectors. So we can assume that they're single tokens.
If you do want to create your patterns programmatically, you could just write a little script that does this:
nlp = spacy.blank("en") # or whichever language
terms = ["Python 3", "Python 2"]
patterns = []
for doc in nlp.pipe(terms):
pattern = [{"LOWER": token.lower_} for token in doc]
patterns.append(pattern)
Thanks @ines for the clarification. I'll go with the script approach. I understand that the intention was to compliment terms.teach and it totally make sense now. However I still believe it would have been great to have a built-in recipe that do the tokenize part or a parameter to add this functionality to the current recipe. I quite like your recipe approach and ecosystem. It's so super intuitive, nice and tidy, keep up great work. Thanks