Issues with custom matchers for NER

Hi!

I'd like to ask a question concerning custom matchers' usage for training custom NER model. I created a jsonl file with my matchers and for most cases it works perfectly fine but I have 2 edge cases that I would need some further advice with:

  1. Usage of regex. I would like to extract numbers from sentences such as "Value of X is less than 5", or "Value of Y is more than 10", where 5 would be labeled as BELOW and 10 would be labeled as ABOVE. To do this, I created the following regular expressions:

a) (?<=less than).?(\d*.)?\d+
b) (?<=more than).?(\d*.)?\d+

I tested them on https://regexr.com/ and they seem to work as I'd like them to work. It resulted in the following lines in the file with my custom matchers:

{"label": "BELOW", "pattern": [{"text": {"regex": "(?<=less than).?(\d*\.)?\d+"}}]}
{"label": "ABOVE", "pattern": [{"text": {"regex": "(?<=more than).?(\d*\.)?\d+"}}]}

When I try to run my ner.manual with those custom matcher I get the following error:

ValueError: Invalid JSON on line 68: {"label": "BELOW", "pattern": [{"text": {"regex": "(?<=less than).?(\d*\.)?\d+"}}]}

It seems that the issue is with the regex itself because when I try to use a simpler one (such as (abc)) it works without any issue. Do you have any idea what could help in my case? I'd be grateful for some advice!

  1. Entities based on a list of potential keywords. I'd like to catch the units in my text and label them as UNIT. I have a list of potential units that may appear in my texts, let's say: ["g", "ml", "g/ml", "cm", ... , "kg"].
    My issue is that sometimes, although I have a longer unit in the text (lets say g/ml) only "g" is selected as UNIT because "g" is also a unit from my list. Is there any workaround for that? Does the order in the list matter? Or maybe I could use some parameter that would take the longer entity if two of them can be potentially selected?

Thank you for your help in advance! :slight_smile:

Hi @Alicja ! Welcome to the support forum! Let me answer the first question:

  1. Your regex is correct, but you just need to escape the backslash \ in JSON. You can use some JSON Validators online (like https://jsonlint.com/) to further verify. I went ahead and tried one of your patterns:
{
	"label": "BELOW",
	"pattern": [{
		"text": {
			"regex": "(?<=less than).?(\\d*\\.)?\\d+"
		}
	}]
}

Note the double backslash at the top. If you're doing this programmatically, you might want to try using the re.escape function:

import re
escaped = re.escape(a_string)
1 Like

Thank you @ljvmiranda921! I tried to apply your suggestion but, unluckily, it still doesn't work. Again, no errors, but the numbers are not detected as entities (see an example below):

image

Hi @akocienia ,

So what happens here is that the detected tokens are actually ["50", "%"], and a pattern that describes one token with a regex wouldn't match. I'd recommend that you write a pattern that covers both.

It might also be useful to implement your own matching logic using regex over the whole text (Create new entities from regex - #3 by ines) . Just ensure that you won't have any overlapping spans :slight_smile:

The advantage of the latter approach is that you have total control of the implementation logic, and write some heuristics where, for example, you only take the longest span if there are any overlaps.

1 Like

Hi @ljvmiranda921!

Thanks for your answer. I'm not sure I understand you properly. In the text, there is a space between "50" and "%". I'd like to have them as separate entities, where "50" is ABOVE and "%" is UNIT. Those entities are not overlapping. I have a separate custom matcher for units which is not connected to the actual value before the unit. Is there any way to make it work?