Does term.to-patterns tokenise patterns?

atilla · August 26, 2019, 4:58pm

I tried to use term.to-patterns recipe to generate patterns.jsonl file whereas it looks like the recipe does not generate tokenised patterns but rather keep strings as they are.

Example usage is:

prodigy terms.to-patterns programming_langs patterns.jsonl --label PROG_LANG

and sample output looks like this:

{"label":"PROG_LANG","pattern":[{"lower":"Python 3"}]}

Unfortunately the above pattern does not match "Python 3" in the text. I'm aware that I can manually add a pattern like this:

{"label": "LANGS", "pattern": [{"LOWER": "python"}, {"NUM_LIKE": True, "OP": "?"}]},

But this is not very useful in this particular case as my intention is to programmatically add many multi-token strings and generate a patterns file without worrying about different use-cases. So is there are an easy way to generate something like this using terms.to-patterns:

{"label": "LANGS", "pattern": [{"LOWER": "python"}, {"LOWER": "3"}]},

The reason for that is to leverage term.to-patterns recipe to automate some pattern generation process.

Thanks in advance.

ines · August 26, 2019, 7:13pm

At the moment it doesn't tokenize patterns, no – it also doesn't take any model, just a dataset. The main reason is that the recipe was originally developed for the terms.teach recipe, which iterates over the model's vocab and vectors. So we can assume that they're single tokens.

If you do want to create your patterns programmatically, you could just write a little script that does this:

nlp = spacy.blank("en")  # or whichever language
terms = ["Python 3", "Python 2"]
patterns = []
for doc in nlp.pipe(terms):
    pattern = [{"LOWER": token.lower_} for token in doc]
    patterns.append(pattern)

atilla · August 26, 2019, 8:37pm

Thanks @ines for the clarification. I'll go with the script approach. I understand that the intention was to compliment terms.teach and it totally make sense now. However I still believe it would have been great to have a built-in recipe that do the tokenize part or a parameter to add this functionality to the current recipe. I quite like your recipe approach and ecosystem. It's so super intuitive, nice and tidy, keep up great work. Thanks

Topic		Replies	Views
How to easily convert some patterns into terms for classification textcat , done , terms , solved	1	603	December 16, 2019
terms.to-patterns looks strange terms , solved	2	1425	October 23, 2018
terms.to-patterns with existing data terms , solved	10	2799	May 29, 2019
Train a new NER entity with multi-word tokens usage , ner , solved	15	9683	September 10, 2019
Incorrect terms.to-patterns example in web documentation docs , usage , done	5	1078	December 28, 2018

Does term.to-patterns tokenise patterns?

Related topics