Match patterns without creating huge files

Hello there

I would like to get products as “iPhone”. For that, I want the model to catch during the ner.teach process :

  • iPhone
  • Apple iPhone
  • Apple iPhone X
  • iPhone X
  • iPhone X Max
  • iPhone 11
  • iPhone Xs

I have those patterns, but if I want to describe everything, I will have a huge file to build.

{"label": "PRODUCT", "pattern": [{"ORTH": "Apple"}, {"ORTH": "iPhone"}]}
{"label": "PRODUCT", "pattern": [{"ORTH": "Apple"}, {"ORTH": "iPhone"}, {"ORTH": "X"}]}

I want to catch “Apple iPhone X” (with the ‘Apple’) or “iPhone X” or “iphone X”…
How can I create a more generic rule using this “pattern notation” ?

The any way I’m thinking of is to define those terms as a regexp and create a multiline patterns file from that, but I’m sure it’s not the best way to do that.

Any tip ?

Thanks a lot

Yes, the pattern files support the same syntax as spaCy’s rule-based Matcher, so you can definitely write “smarter” token patterns. For example, here’s a pattern that matches the case-insensitive tokens “apple” and “iphone”, and an optional number token like “11”:

[{"LOWER": "apple"}, {"LOWER": "iphone"}, {"IS_DIGIT": True, "OP": "?"}]

There’s a lot more you can do with token attributes here. If you’re starting off with a pre-trained model, you could also use part-of-speech tags. For example, only match “love” if it’s used as a verb and not as a noun.

Important note: The rule-based matching docs also describe some new features like the extended pattern syntax that are only available in spaCy v2.1. Those are also marked with a little “2.1” tag. You’ll be able to use those once the new version of Prodigy for spaCy v2.1 is available – see here for details.

1 Like

Thanks Ines, I think that the “OP” was what I was looking for :

pattern = [{"ORTH": "Apple", "OP": "?"}, {"LOWER": "iphone"}, {"LOWER": "x", "OP": "?"}, {"IS_DIGIT": True, "OP": "?"}]

with With 1 bitcoin, I bought an Apple iPhone X, an Apple iphone 11 and a iPhone 8, and I got:

7 9 Apple iPhone
8 9 iPhone
7 10 Apple iPhone X
8 10 iPhone X
12 15 Apple iphone 11
13 15 iphone 11
17 19 iPhone 8
1 Like

Nice!

Btw, when using the operators like "?" and "*" in more complex cases, it’s possible that you come across some inconsistencies. This will all be fixed in spaCy v2.1.

1 Like

Great, thanks, I’m good to go !

$ python -m spacy info
spaCy version 2.1.1

Just keep in mind that the current version of Prodigy isn't compatible with spaCy v2.1 yet – we want to make sure the new version is 100% stable and tested by everyone before we ask Prodigy users to retrain their models. See here for details:

1 Like