How to easily convert some patterns into terms for classification

Hi, I am trying to teach some complex terms for classification. I prepared some phrases that can be appearing in given class (not just one word!). For example: "I can't download" and then I converted that into terms which gave me the resulting spacy style pattern "{"label":"SOME_CLASS","pattern":[{"lower":"I can't download"}]}". I have used pattern matching many times in spacy and from that experience I am assuming that such pattern will never find anything unless I create custom tokenizer which tokenizes text in such a way that that whole phrase will be one token.
I need at least something like that: {"label":"SOME_CLASS","pattern":[{"lower":"I"}, {"lower": "can't"}, {"lower":"download"}]}. Is it supported and I didn't found?

At the moment, the terms.to-patterns recipe doesn't tokenize (although it will in the next version). But creating those patterns shouldn't be very difficult – all you have to do is tokenize the text:

phrases = ["I can't download"]
nlp = spacy.blank("en")
patterns = []
for doc in nlp.pipe(phrases):
    pattern = [{"lower": token.lower_} for token in doc]
    patterns.append({"label": "SOME_CLASS", "pattern": pattern})

Edit: Now also shipped in v1.9: you can set a --spacy-model argument on terms.to-patterns that's either the name of a model or blank:en etc. (to just use a blank language tokenizer).

1 Like