Hi!
I'd like to ask a question concerning custom matchers' usage for training custom NER model. I created a jsonl file with my matchers and for most cases it works perfectly fine but I have 2 edge cases that I would need some further advice with:
- Usage of regex. I would like to extract numbers from sentences such as "Value of X is less than 5", or "Value of Y is more than 10", where 5 would be labeled as BELOW and 10 would be labeled as ABOVE. To do this, I created the following regular expressions:
a) (?<=less than).?(\d*.)?\d+
b) (?<=more than).?(\d*.)?\d+
I tested them on https://regexr.com/ and they seem to work as I'd like them to work. It resulted in the following lines in the file with my custom matchers:
{"label": "BELOW", "pattern": [{"text": {"regex": "(?<=less than).?(\d*\.)?\d+"}}]}
{"label": "ABOVE", "pattern": [{"text": {"regex": "(?<=more than).?(\d*\.)?\d+"}}]}
When I try to run my ner.manual with those custom matcher I get the following error:
ValueError: Invalid JSON on line 68: {"label": "BELOW", "pattern": [{"text": {"regex": "(?<=less than).?(\d*\.)?\d+"}}]}
It seems that the issue is with the regex itself because when I try to use a simpler one (such as (abc)) it works without any issue. Do you have any idea what could help in my case? I'd be grateful for some advice!
- Entities based on a list of potential keywords. I'd like to catch the units in my text and label them as UNIT. I have a list of potential units that may appear in my texts, let's say: ["g", "ml", "g/ml", "cm", ... , "kg"].
My issue is that sometimes, although I have a longer unit in the text (lets say g/ml) only "g" is selected as UNIT because "g" is also a unit from my list. Is there any workaround for that? Does the order in the list matter? Or maybe I could use some parameter that would take the longer entity if two of them can be potentially selected?
Thank you for your help in advance!