regex + training categories

Hi,

Very new to nlp and spacy but excited to use these tools.
Looking for some help on how to create a pipeline where I can initially label a bunch of entities (PHONE NUMBER , ADDRESS, ORDER_ID.. others) using regex but then as a second step categorize the docs in a trained model using text.cat.

Few things I am unsure of - if the regex is performed before the ner - does the regex labeled entities influence the output of the statistical model? Is there a preferable pattern or example of how to do this? Finally what I am really wanting is the ability to extract all the entities , tabulate them with their categorical classification - is there a function that does this already ?

Thanks,
Mark

Hi,

Apologies for the delay replying to this --- I missed the thread initially. Sorry!

You can use regex to classify entities in spaCy, using the EntityRuler component. The entities you set will affect the NER model's predictions, because the NER won't overwrite the previously set entities. However, the textcat model doesn't pay attention to the NER classifications, so this won't affect the textcat decisions.

There's no function to tabulate the entities, because we've preferred to keep the API surface a bit smaller. You should find it easy to do this with your own loop. If you're reading the data from Prodigy, you can get the annotations out using prodigy db-out command, which will give you newline-delimited JSON that's very easy to work with. If the annotations are already on spaCy Doc objects, you just need to use doc.ents to get the entities.

To add to this: Most of the time, your regular expressions will probably be written over the whole doc.text, not on a per-token basis (which is what spaCy's Matcher supports). In that case, you could also write a custom pipeline component that uses re.finditer on the doc.text to find the matches, calls doc.char_span to create a Span object with a given label and adds the spans to the doc.ents.

For example, something like this:

def add_regex_entities(doc):
    label = "SOME_ENTITY_LABEL"
    expression = r'...' # your regex here
    spans = []
    for match in re.finditer(expression, doc.text):
        start, end = match.span()
        span = doc.char_span(start, end, label=label)
        spans.append(span)
    doc.ents = list(doc.ents) + spans
    return doc

nlp.add_pipe(add_regex_entities)
1 Like

Hi Ines and @tiangolo,

With prodigy I am building a NER model deployed on Fastapi. To detect some entites (like date) I use the add_regex_entities function which Ines wrote above and added to the nlp pipeline before ner. Most of the time it works fine as expected. I upload a file and it extracts entities from the texts. But sometimes it throws internal server error given a different piece of text.

Reading the log below, I can't tell where the error exact comes from. Could you have a look and help me ? Thank you for your time!

   ?[32mINFO?[0m:     127.0.0.1:51979 - "?[1mPOST /uploadfile/?[0m HTTP/1.1?[0m" ?[91m500 Internal Server Error?[0m
?[31mERROR?[0m:    Exception in ASGI application
Traceback (most recent call last):
  File "C:\Users\xia.he\Desktop\HortiSem\hortisem-env\lib\site-packages\uvicorn\protocols\http\h11_impl.py", line 388, in run_asgi
    result = await app(self.scope, self.receive, self.send)
  File "C:\Users\xia.he\Desktop\HortiSem\hortisem-env\lib\site-packages\uvicorn\middleware\proxy_headers.py", line 45, in __call__
    return await self.app(scope, receive, send)
  File "C:\Users\xia.he\Desktop\HortiSem\hortisem-env\lib\site-packages\uvicorn\middleware\debug.py", line 81, in __call__
    raise exc from None
  File "C:\Users\xia.he\Desktop\HortiSem\hortisem-env\lib\site-packages\uvicorn\middleware\debug.py", line 78, in __call__
    await self.app(scope, receive, inner_send)
  File "C:\Users\xia.he\Desktop\HortiSem\hortisem-env\lib\site-packages\fastapi\applications.py", line 179, in __call__
    await super().__call__(scope, receive, send)
  File "C:\Users\xia.he\Desktop\HortiSem\hortisem-env\lib\site-packages\starlette\applications.py", line 111, in __call__
    await self.middleware_stack(scope, receive, send)
  File "C:\Users\xia.he\Desktop\HortiSem\hortisem-env\lib\site-packages\starlette\middleware\errors.py", line 181, in __call__
    raise exc from None
  File "C:\Users\xia.he\Desktop\HortiSem\hortisem-env\lib\site-packages\starlette\middleware\errors.py", line 159, in __call__
    await self.app(scope, receive, _send)
  File "C:\Users\xia.he\Desktop\HortiSem\hortisem-env\lib\site-packages\starlette\exceptions.py", line 82, in __call__
    raise exc from None
  File "C:\Users\xia.he\Desktop\HortiSem\hortisem-env\lib\site-packages\starlette\exceptions.py", line 71, in __call__
    await self.app(scope, receive, sender)
  File "C:\Users\xia.he\Desktop\HortiSem\hortisem-env\lib\site-packages\starlette\routing.py", line 566, in __call__
    await route.handle(scope, receive, send)
  File "C:\Users\xia.he\Desktop\HortiSem\hortisem-env\lib\site-packages\starlette\routing.py", line 227, in handle
    await self.app(scope, receive, send)
  File "C:\Users\xia.he\Desktop\HortiSem\hortisem-env\lib\site-packages\starlette\routing.py", line 41, in app
    response = await func(request)
  File "C:\Users\xia.he\Desktop\HortiSem\hortisem-env\lib\site-packages\fastapi\routing.py", line 182, in app
    raw_response = await run_endpoint_function(
  File "C:\Users\xia.he\Desktop\HortiSem\hortisem-env\lib\site-packages\fastapi\routing.py", line 133, in run_endpoint_function
    return await dependant.call(**values)
  File "M:\Projekt\HortiSem\app.py", line 124, in create_upload_file
    ents = predict(input_text,nlp)
  File "M:\Projekt\HortiSem\app.py", line 104, in predict
    doc = nlp_model(text)
  File "C:\Users\xia.he\Desktop\HortiSem\hortisem-env\lib\site-packages\spacy\language.py", line 449, in __call__
    doc = proc(doc, **component_cfg.get(name, {}))
  File "M:\Projekt\HortiSem\app.py", line 75, in add_regex_entities
    doc.ents = list(doc.ents) + spans
  File "doc.pyx", line 550, in spacy.tokens.doc.Doc.ents.__set__
  File "doc.pyx", line 1370, in spacy.tokens.doc.get_entity_info
TypeError: object of type 'NoneType' has no len()
ERROR:uvicorn.error:Exception in ASGI application
Traceback (most recent call last):
  File "C:\Users\xia.he\Desktop\HortiSem\hortisem-env\lib\site-packages\uvicorn\protocols\http\h11_impl.py", line 388, in run_asgi
    result = await app(self.scope, self.receive, self.send)
  File "C:\Users\xia.he\Desktop\HortiSem\hortisem-env\lib\site-packages\uvicorn\middleware\proxy_headers.py", line 45, in __call__
    return await self.app(scope, receive, send)
  File "C:\Users\xia.he\Desktop\HortiSem\hortisem-env\lib\site-packages\uvicorn\middleware\debug.py", line 81, in __call__
    raise exc from None
  File "C:\Users\xia.he\Desktop\HortiSem\hortisem-env\lib\site-packages\uvicorn\middleware\debug.py", line 78, in __call__
    await self.app(scope, receive, inner_send)
  File "C:\Users\xia.he\Desktop\HortiSem\hortisem-env\lib\site-packages\fastapi\applications.py", line 179, in __call__
    await super().__call__(scope, receive, send)
  File "C:\Users\xia.he\Desktop\HortiSem\hortisem-env\lib\site-packages\starlette\applications.py", line 111, in __call__
    await self.middleware_stack(scope, receive, send)
  File "C:\Users\xia.he\Desktop\HortiSem\hortisem-env\lib\site-packages\starlette\middleware\errors.py", line 181, in __call__
    raise exc from None
  File "C:\Users\xia.he\Desktop\HortiSem\hortisem-env\lib\site-packages\starlette\middleware\errors.py", line 159, in __call__
    await self.app(scope, receive, _send)
  File "C:\Users\xia.he\Desktop\HortiSem\hortisem-env\lib\site-packages\starlette\exceptions.py", line 82, in __call__
    raise exc from None
  File "C:\Users\xia.he\Desktop\HortiSem\hortisem-env\lib\site-packages\starlette\exceptions.py", line 71, in __call__
    await self.app(scope, receive, sender)
  File "C:\Users\xia.he\Desktop\HortiSem\hortisem-env\lib\site-packages\starlette\routing.py", line 566, in __call__
    await route.handle(scope, receive, send)
  File "C:\Users\xia.he\Desktop\HortiSem\hortisem-env\lib\site-packages\starlette\routing.py", line 227, in handle
    await self.app(scope, receive, send)
  File "C:\Users\xia.he\Desktop\HortiSem\hortisem-env\lib\site-packages\starlette\routing.py", line 41, in app
    response = await func(request)
  File "C:\Users\xia.he\Desktop\HortiSem\hortisem-env\lib\site-packages\fastapi\routing.py", line 182, in app
    raw_response = await run_endpoint_function(
  File "C:\Users\xia.he\Desktop\HortiSem\hortisem-env\lib\site-packages\fastapi\routing.py", line 133, in run_endpoint_function
    return await dependant.call(**values)
  File "M:\Projekt\HortiSem\app.py", line 124, in create_upload_file
    ents = predict(input_text,nlp)
  File "M:\Projekt\HortiSem\app.py", line 104, in predict
    doc = nlp_model(text)
  File "C:\Users\xia.he\Desktop\HortiSem\hortisem-env\lib\site-packages\spacy\language.py", line 449, in __call__
    doc = proc(doc, **component_cfg.get(name, {}))
  File "M:\Projekt\HortiSem\app.py", line 75, in add_regex_entities
    doc.ents = list(doc.ents) + spans
  File "doc.pyx", line 550, in spacy.tokens.doc.Doc.ents.__set__
  File "doc.pyx", line 1370, in spacy.tokens.doc.get_entity_info
TypeError: object of type 'NoneType' has no len()