Very new to nlp and spacy but excited to use these tools.
Looking for some help on how to create a pipeline where I can initially label a bunch of entities (PHONE NUMBER , ADDRESS, ORDER_ID.. others) using regex but then as a second step categorize the docs in a trained model using text.cat.
Few things I am unsure of - if the regex is performed before the ner - does the regex labeled entities influence the output of the statistical model? Is there a preferable pattern or example of how to do this? Finally what I am really wanting is the ability to extract all the entities , tabulate them with their categorical classification - is there a function that does this already ?
Apologies for the delay replying to this --- I missed the thread initially. Sorry!
You can use regex to classify entities in spaCy, using the
EntityRuler component. The entities you set will affect the NER model's predictions, because the NER won't overwrite the previously set entities. However, the textcat model doesn't pay attention to the NER classifications, so this won't affect the textcat decisions.
There's no function to tabulate the entities, because we've preferred to keep the API surface a bit smaller. You should find it easy to do this with your own loop. If you're reading the data from Prodigy, you can get the annotations out using
prodigy db-out command, which will give you newline-delimited JSON that's very easy to work with. If the annotations are already on spaCy
Doc objects, you just need to use
doc.ents to get the entities.
To add to this: Most of the time, your regular expressions will probably be written over the whole
doc.text, not on a per-token basis (which is what spaCy's
Matcher supports). In that case, you could also write a custom pipeline component that uses
re.finditer on the
doc.text to find the matches, calls
doc.char_span to create a
Span object with a given label and adds the spans to the
For example, something like this:
label = "SOME_ENTITY_LABEL"
expression = r'...' # your regex here
spans = 
for match in re.finditer(expression, doc.text):
start, end = match.span()
span = doc.char_span(start, end, label=label)
doc.ents = list(doc.ents) + spans