Match patterns without creating huge files

iero · March 21, 2019, 4:22pm

Hello there

I would like to get products as “iPhone”. For that, I want the model to catch during the ner.teach process :

iPhone
Apple iPhone
Apple iPhone X
iPhone X
iPhone X Max
iPhone 11
iPhone Xs
…

I have those patterns, but if I want to describe everything, I will have a huge file to build.

{"label": "PRODUCT", "pattern": [{"ORTH": "Apple"}, {"ORTH": "iPhone"}]}
{"label": "PRODUCT", "pattern": [{"ORTH": "Apple"}, {"ORTH": "iPhone"}, {"ORTH": "X"}]}

I want to catch “Apple iPhone X” (with the ‘Apple’) or “iPhone X” or “iphone X”…
How can I create a more generic rule using this “pattern notation” ?

The any way I’m thinking of is to define those terms as a regexp and create a multiline patterns file from that, but I’m sure it’s not the best way to do that.

Any tip ?

Thanks a lot

ines · March 21, 2019, 5:41pm

Yes, the pattern files support the same syntax as spaCy’s rule-based Matcher, so you can definitely write “smarter” token patterns. For example, here’s a pattern that matches the case-insensitive tokens “apple” and “iphone”, and an optional number token like “11”:

[{"LOWER": "apple"}, {"LOWER": "iphone"}, {"IS_DIGIT": True, "OP": "?"}]

There’s a lot more you can do with token attributes here. If you’re starting off with a pre-trained model, you could also use part-of-speech tags. For example, only match “love” if it’s used as a verb and not as a noun.

Important note: The rule-based matching docs also describe some new features like the extended pattern syntax that are only available in spaCy v2.1. Those are also marked with a little “2.1” tag. You’ll be able to use those once the new version of Prodigy for spaCy v2.1 is available – see here for details.

iero · March 21, 2019, 6:18pm

Thanks Ines, I think that the “OP” was what I was looking for :

pattern = [{"ORTH": "Apple", "OP": "?"}, {"LOWER": "iphone"}, {"LOWER": "x", "OP": "?"}, {"IS_DIGIT": True, "OP": "?"}]

with With 1 bitcoin, I bought an Apple iPhone X, an Apple iphone 11 and a iPhone 8, and I got:

7 9 Apple iPhone
8 9 iPhone
7 10 Apple iPhone X
8 10 iPhone X
12 15 Apple iphone 11
13 15 iphone 11
17 19 iPhone 8

ines · March 21, 2019, 6:19pm

Nice!

Btw, when using the operators like "?" and "*" in more complex cases, it’s possible that you come across some inconsistencies. This will all be fixed in spaCy v2.1.

iero · March 21, 2019, 6:27pm

Great, thanks, I’m good to go !

$ python -m spacy info
spaCy version 2.1.1

ines · March 21, 2019, 6:32pm

Just keep in mind that the current version of Prodigy isn't compatible with spaCy v2.1 yet – we want to make sure the new version is 100% stable and tested by everyone before we ask Prodigy users to retrain their models. See here for details:

Topic		Replies	Views
REGEX operator in the patterns file usage , spacy , solved	11	1864	August 3, 2020
Using patterns for multi-word expressions usage , solved	3	1357	November 9, 2018
patterns using regex or shape usage , spacy	13	3672	March 5, 2018
Prodigy patterns not behaving like Spacy patterns usage , spacy , solved	19	2132	May 29, 2019
Bootstrapping terms with pattern file usage	7	1437	July 9, 2019

Match patterns without creating huge files

Related topics