Create new entities from regex

Hi,
I need to create custom entities using my existent regex.

For example I need to create Organizations matching the following:
ORGANIZATION_REGEX_PATTERN = “(( [A-Z]\w+((á|Á)|(é|É|É)|(í|Í)|(ó|Ó)|(ú|Ú)|)\w+( (Y|y)|)){0,2} (de(l|)( l(a|o)(s|)|)|)( |)(([A-Z]\w+((á|Á)|(é|É|É)|(í|Í)|(ó|Ó)|(ú|Ú)|)\w+){0,2}|) (S(PA|pA)|spa|S.A(| ).|s.a(| ).|e.i.r.l(.|)|E.I.R.L(.|)|((L(tda|TDA)|ltda|L(IMITADA|imitada))( y (S.A(| ).|s.a(| ).)|)|limitada|company|sociedad anonima|sociedad por acciones|Asociación Gremial|A.G(| ).))|(C(orp|ORP)|Corporaci(ó|o)n)( de(l|)( l(a|o)(s|)|)|)( [A-Z]\w+((á|Á)|(é|É)|(í|Í)|(ó|Ó)|(ú|Ú)|)\w+( (Y|y)|)){1,3})”

So how can I procede?

Hi there, please I need some advice here so I can proceed.

Thanks,
Joaquín

Hi! We try to do our best and answer questions as soon as possible, and I usually put a lot of effort into my replies. However, we can’t guarantee instant replies and help with your implementation. You posted your question late at night my time, and already bumped the thread at noon my time. This really isn’t productive.

You can also always use the search function (button in the top right corner) to see if a question has already been answered before. For example, if you type in “regex”, you’ll find threads related to using regular expressions: https://support.prodi.gy/search?q=“regex” The first result actually shows a very similar approach and solution.

If you just want to stream in regex matches and annotate whether they are correct / suitable training data or not, the easiest way would be to write a function that takes the incoming stream of examples, finds matches in the texts and creates an annotation example with a "span" for each match (see the “Annotation task formats” in your PRODIGY_README.html for details on the JSON format).

Here’s a simple example:

import re
import copy

expression = re.compile(YOUR_REGEX_HERE)
label = 'ORG'  # or any other label

def regex_matcher(stream):
    for eg in stream:
        for match in re.finditer(expression, eg['text']):  # find match in example text
            task = copy.deepcopy(eg)  # match found – copy the example
            start, end = match.span()  # get matched indices
            task['spans'] = [{'start': start, 'end': end, 'label': label}]  # label match
            yield task

Here’s a custom recipe template to get you started:

Using the view_id "ner", you can render the examples as highlighted entities, and then accept or reject them. The annotations will then be saved to the given dataset, and you can then use them to update a model.

Hi Ines!,
Thanks for your reply. I’m able to open the webpage and start training, but I don’t see nothing selected, do I need to accept when the entity is in upper case? or when I only see the correct entity?

Thanks again for tour help
Joaquin

If there’s no entity in the text, you should accept it – texts with no entities are also very important for your training data. You don’t only want to show the model examples of texts with entities, you also want to be showing it examples of texts without entities.

If the text has an entity and it’s not highlighted, that means your rules didn’t catch it. So in that case, you should reject the example. You should also recject examples that are “almost correct” or partially highlighted – because when you’re training your model later on, the accepted entities you show it should be fully correct.

There is never highlighted text during “mark” custom recipe. i changed my regex just top find one word and I still don’t get anything highlighted.

is there something to configure on the web?

OK I had to connect the regex_matcher in the stream. Now it’s working, I’ll contact you If Ihave more questions

Thanks

Hi Ines,
Now I’m having issues trying to train my model so I can test it.
Here is my log:
prodigy ner.batch-train org_ner es_core_news_md --output date-model --label DATE --eval-split 0.5
Using 1 labels: DATE

Loaded model es_core_news_md
Using 50% of accept/reject examples (1491) for evaluation
Traceback (most recent call last):
File “/home/joaquinu/.conda/envs/tf36/lib/python3.6/runpy.py”, line 193, in _run_module_as_main
main”, mod_spec)
File “/home/joaquinu/.conda/envs/tf36/lib/python3.6/runpy.py”, line 85, in _run_code
exec(code, run_globals)
File “/home/joaquinu/.conda/envs/tf36/lib/python3.6/site-packages/prodigy/main.py”, line 259, in
controller = recipe(args, use_plac=True)
File “cython_src/prodigy/core.pyx”, line 253, in prodigy.core.recipe.recipe_decorator.recipe_proxy
File “/home/joaquinu/.conda/envs/tf36/lib/python3.6/site-packages/plac_core.py”, line 328, in call
cmd, result = parser.consume(arglist)
File “/home/joaquinu/.conda/envs/tf36/lib/python3.6/site-packages/plac_core.py”, line 207, in consume
return cmd, self.func(
(args + varargs + extraopts), **kwargs)
File “/home/joaquinu/.conda/envs/tf36/lib/python3.6/site-packages/prodigy/recipes/ner.py”, line 426, in batch_train
examples = list(split_sentences(model.orig_nlp, examples))
File “cython_src/prodigy/components/preprocess.pyx”, line 38, in split_sentences
File “cython_src/prodigy/components/preprocess.pyx”, line 150, in prodigy.components.preprocess._add_tokens
KeyError: 21

I also want to ask you about the regex for prodigy. I was able to use it, but it seems that it’s only useful to annotate examples so we can train the model afterwards. Is this true or I’m missing a point here?

Thanks,
Joaquin

Ah, this should be fixed in the next version. For now, try setting --unsegmented when you run ner.teach.

Yes, you can use the regular expressions to help you annotate examples faster, so you can train a model.

If your goal is not to train a model to generalise based on your regular expressions and you just want your matches labelled in your data, you can just write a Python script that matches your regex in a text and returns the matches.