merging a data annotated by regex with the annotated data by prodigy

robertto · August 7, 2019, 4:09pm

I have trained a customized NER model by prodigy, it has a very good result. but for some reason for one label it does not work well so that sometimes gives the first part of the word as an entity, like for

imagine "22" 50' north" but it gives only "22" 50 as an entity or

50" 23' it only gives s 50 " 23 as an entity not with the last character "

however, regex gives me the very precise answer, how can save the data without that label and then label my data by regex only for the special entity that has a problem and then merge them together.

Many thanks

ines · August 7, 2019, 4:20pm

Hi! In that case, you probably want to look into writing a simple custom pipeline component that takes the doc.text, uses re.finditer to find the start and end character offset of your match and then uses doc.char_span to create an entity span for the given match.

If you add the component before the statistical ner component that you trained, it'll take the predefined entities into account when it makes its predictions. So basically, it will only "predict around them". I haven't tested this yet, but something along the lines of this should work:

def add_regex_entities(doc):
    entities = []
    for match in re.finditer(YOUR_EXPRESSION, doc.text):  # find match in text
        start, end = match.span()  # get the matched token indices
        span = doc.char_span(start, end, label="YOUR_LABEL")
    doc.ents = list(doc.ents) + entities
    return doc

nlp.add_pipe(add_regex_entities, before="ner")

Keep in mind that this won't work if the start/end positions matched by your expression do not actually map to valid tokens. In that case, spaCy will raise an error. For instance, if you want to tag 22" in the string 22"xyz, but spaCy doesn't split it in a way that 22" is its own token, you also won't be able to create an entity span for it. In that case, you might need to tweak the tokenization rules a bit, or add a rule to split tokens like this.

Topic		Replies	Views
regex + training categories usage , spacy	2	657	August 19, 2019
spaCy, prodigy, annotation usage , ner , solved	2	727	February 8, 2019
Create new entities from regex usage	8	1008	January 30, 2019
NER or PhraseMatcher? ner , spacy , best-practices	17	6099	September 20, 2018
Model only recognizes part of the entity in coordinates usage , ner , spacy , solved	4	408	August 29, 2019

merging a data annotated by regex with the annotated data by prodigy

Related topics