Find decimals and number ranges for patterns

To speed my annotations I am using patterns and have been looking for methods to find decimals and number ranges. For decimals found

{"label": "DECIMAL", "pattern": [{"SHAPE": "dd.dd"}]}

It actually took me a long time to work this out so: Question 1, can you point me to the shape documentation, as I want to investigate whether there are other things it could help me with.

Whilst shape works, previously I was trying to find decimals with regex which I cant get to work

Question 2 Is it actually possible to use regex to match patterns in prodigy and if so can you help me find decimals eg 34.56 with regex (shape works, but this will help me ‘see’ potential other solutions). My attempt was

{"label": "DECIMAL", "pattern": [{"TEXT": {"REGEX": "^[0-9]{2}$"}}, {"TEXT": {"REGEX": "^\\.$"}}, {"TEXT": {"REGEX": "^[0-9]{2}$"}}]}
{"label": "DECIMAL", "pattern": [{"TEXT": {"REGEX": "^[0-9]{2}$\\.\\^[0-9]{2}$"}}]}

Question 3
If regex does work to create patterns can you help me the correct syntax to match Number ranges eg:
numbers between 30.01 and 30.99 and 50.00 and 50.99. My attempt

{"label": "NUM_RANGE", "pattern": [{"TEXT": {"REGEX": "^30\\.\\d{2}$"}}]}
{"label": "NUM_RANGE", "pattern": [{"TEXT": {"REGEX": "^50\\.\\d{2}$"}}]}

I intend to use patterns for both NER and SpanCat tasks

Thank you as ever - this is really great forum both to ask questions and see others Q and A and is much appreciated.

Hi @alphie,

Glad to hear you find the forum helpful :slight_smile:

Re Question 1
Prodigy Matcher uses spaCy Matcher under the hood so the spaCy docs on rule-based matching should be the place to look for information.
SHAPE uses the shape attribute of tokens which is documented here (you can always print the orthographic shape of every token (.shape_) to see what the pattern you should use in the Matcher rules.

Re Question 2
The reason why you're first DECIMAL pattern doesn't work is that spaCy regex patterns are applied to a single token. Your first patterns defines a sequence of three tokens which does not appear in the input text because 34.56 is a single token - not a sequence of 3 tokens (if you use spaCy default tokenizer).
The second pattern is a better attempt but it contains token boundaries markers inside which also makes it impossible to match on the text. Here's the corrected version of the pattern that matches dd.dd kind of decimals:

{"label": "DECIMAL", "pattern" :[{"TEXT": {"REGEX": "^[0-9]{2}\\.[0-9]{2}$"}}]}

Another approach could be leverage spaCy built-in pattern matching to find numbers with LIKE_NUM token attribute and then apply decimal detection regex only to these tokens.

Re Question 3
Your patterns look correct just made it a bit more precise because the first range starts with 01 and the second with 00.

# 30.01 - 30.99
{"label": "NUM_RANGE", "pattern": [{"TEXT": {"REGEX": "^30\\.[0-9][1-9]$"}}]}
# 50.00 - 50.99
{"label": "NUM_RANGE", "pattern": [{"TEXT": {"REGEX": "^30\\.[0-9][0-9]$"}}]}

Brilliant thanks so much. solved

1 Like