I have trained a customized NER model by prodigy, it has a very good result. but for some reason for one label it does not work well so that sometimes gives the first part of the word as an entity, like for
imagine "22" 50' north" but it gives only "22" 50 as an entity or
50" 23' it only gives s 50 " 23 as an entity not with the last character "
however, regex gives me the very precise answer, how can save the data without that label and then label my data by regex only for the special entity that has a problem and then merge them together.
Hi! In that case, you probably want to look into writing a simple custom pipeline component that takes the doc.text, uses re.finditer to find the start and end character offset of your match and then uses doc.char_span to create an entity span for the given match.
If you add the component before the statistical ner component that you trained, it'll take the predefined entities into account when it makes its predictions. So basically, it will only "predict around them". I haven't tested this yet, but something along the lines of this should work:
def add_regex_entities(doc):
entities = []
for match in re.finditer(YOUR_EXPRESSION, doc.text): # find match in text
start, end = match.span() # get the matched token indices
span = doc.char_span(start, end, label="YOUR_LABEL")
doc.ents = list(doc.ents) + entities
return doc
nlp.add_pipe(add_regex_entities, before="ner")
Keep in mind that this won't work if the start/end positions matched by your expression do not actually map to valid tokens. In that case, spaCy will raise an error. For instance, if you want to tag 22" in the string 22"xyz, but spaCy doesn't split it in a way that 22" is its own token, you also won't be able to create an entity span for it. In that case, you might need to tweak the tokenization rules a bit, or add a rule to split tokens like this.