I want to catch “Apple iPhone X” (with the ‘Apple’) or “iPhone X” or “iphone X”…
How can I create a more generic rule using this “pattern notation” ?
The any way I’m thinking of is to define those terms as a regexp and create a multiline patterns file from that, but I’m sure it’s not the best way to do that.
Yes, the pattern files support the same syntax as spaCy’s rule-based Matcher, so you can definitely write “smarter” token patterns. For example, here’s a pattern that matches the case-insensitive tokens “apple” and “iphone”, and an optional number token like “11”:
There’s a lot more you can do with token attributes here. If you’re starting off with a pre-trained model, you could also use part-of-speech tags. For example, only match “love” if it’s used as a verb and not as a noun.
Important note: The rule-based matching docs also describe some new features like the extended pattern syntax that are only available in spaCy v2.1. Those are also marked with a little “2.1” tag. You’ll be able to use those once the new version of Prodigy for spaCy v2.1 is available – see here for details.
Btw, when using the operators like "?" and "*" in more complex cases, it’s possible that you come across some inconsistencies. This will all be fixed in spaCy v2.1.
Just keep in mind that the current version of Prodigy isn't compatible with spaCy v2.1 yet – we want to make sure the new version is 100% stable and tested by everyone before we ask Prodigy users to retrain their models. See here for details: