recognizing digits using NER

I have done a refined named entity recognition. The model works very well, but I have a problem regarding one label which I called it:

PARA

It should capture all numbers between 3 till 10 digits. I have got this number by regex,

label = "PARA"   # whatever label you want to use
texts = sents  # a list of your texts
regex_patterns = [
re.compile(r'(?<!\S)\d{1,3}(?:,\d{3})+?(?!\S?\S)')
examples = []
for text in texts:
    for expression in regex_patterns:
        spans = []
    for match in re.finditer(expression, text):
        start, end = match.span()
        span = {"start": start, "end": end, "label": label}
        spans.append(span)
    task = {"text": text, "spans": spans}
    examples.append(task)              

write_jsonl("NER_PARA_V02.jsonl", examples)

but after training, it can not recognize all and the result is not satisfying:

What should I do to improve the result?

Many thanks, congratulation for Prodigy, It is a perfect tool for NLP

So your label PARA is all digits in the document? Are you sure you want to train a model with a category for all digits in any context? This is such a classic case for regular expressions or similar rules, and I doubt it'll be worth the investment? It seems unlikely that a statistical model will perform better than your actual regex – it'll likely be less accurate, much slower and much less transparent.

Looking at your example, I'd suggest to write a custom pipeline component that does what your script does and adds the extracted spans to the doc.ents. Then add it before the statistical named entity recognizer in the pipeline, and it'll pre-set all digits.

1 Like

thank you for the quick responses, like always you made very helpful comments, I will work on that, i will send you some question as soon as possible till afternoon. by the way you mean this part:

import spacy

def custom_sentencizer(doc):
    for i, token in enumerate(doc[:-2]):
        # Define sentence start if pipe + titlecase token
        if token.text == "|" and doc[i+1].is_title:
            doc[i+1].is_sent_start = True
        else:
            # Explicitly set sentence start to False otherwise, to tell
            # the parser to leave those tokens alone
            doc[i+1].is_sent_start = False
    return doc

nlp = spacy.load("en_core_web_sm")
nlp.add_pipe(custom_sentencizer, before="parser")  # Insert before the parser
doc = nlp("This is. A sentence. | This is. Another sentence.")
for sent in doc.sents:
    print(sent.text)

I do not know how to add my regex after or before making a model?

you are right! I also feel in so many cases regex is enough. I faced with some general question, when do you think it is better to use regex or DL? I mean how do you decide to use which tools? regex is very quick and could work on all domains.
For example which kind of label do you recommend that I should use the only regex:

I have
DATE,TIME,PARA,(numbers), ASTR(astronomical name),LONG(coordinate),STAR,PLAN(planet names),NAMES,GEOM(geometrical names,square...)

when we are working on a specific domain as a series of books of a scientist, what is the advantage of having a DL model?

Many thanks