This is one of the downsides of relying on statistical predictions in match patterns – you kinda lose the reliability of rule-based matching, because the results now depend on what the model happens to predict for a particular example. So for things like numbers, IS_DIGIT
or LIKE_NUM
is definitely the more reliable option. In spaCy v2.1+, you can also use the IN
operator (for set membership) or REGEX
to describe the token using a regular expression.
Do you have a small reproducible example that shows the "skipping" behaviour you mean?
Have you upgraded to v1.8.2? If you were seeing warnings about empty vectors that slowed down the startup, that might be the problem. See here for details: