regex + training categories


Very new to nlp and spacy but excited to use these tools.
Looking for some help on how to create a pipeline where I can initially label a bunch of entities (PHONE NUMBER , ADDRESS, ORDER_ID.. others) using regex but then as a second step categorize the docs in a trained model using

Few things I am unsure of - if the regex is performed before the ner - does the regex labeled entities influence the output of the statistical model? Is there a preferable pattern or example of how to do this? Finally what I am really wanting is the ability to extract all the entities , tabulate them with their categorical classification - is there a function that does this already ?



Apologies for the delay replying to this --- I missed the thread initially. Sorry!

You can use regex to classify entities in spaCy, using the EntityRuler component. The entities you set will affect the NER model's predictions, because the NER won't overwrite the previously set entities. However, the textcat model doesn't pay attention to the NER classifications, so this won't affect the textcat decisions.

There's no function to tabulate the entities, because we've preferred to keep the API surface a bit smaller. You should find it easy to do this with your own loop. If you're reading the data from Prodigy, you can get the annotations out using prodigy db-out command, which will give you newline-delimited JSON that's very easy to work with. If the annotations are already on spaCy Doc objects, you just need to use doc.ents to get the entities.

To add to this: Most of the time, your regular expressions will probably be written over the whole doc.text, not on a per-token basis (which is what spaCy's Matcher supports). In that case, you could also write a custom pipeline component that uses re.finditer on the doc.text to find the matches, calls doc.char_span to create a Span object with a given label and adds the spans to the doc.ents.

For example, something like this:

def add_regex_entities(doc):
    label = "SOME_ENTITY_LABEL"
    expression = r'...' # your regex here
    spans = []
    for match in re.finditer(expression, doc.text):
        start, end = match.span()
        span = doc.char_span(start, end, label=label)
    doc.ents = list(doc.ents) + spans
    return doc