Hi! In that case, you probably want to look into writing a simple custom pipeline component that takes the
re.finditer to find the start and end character offset of your match and then uses
doc.char_span to create an entity span for the given match.
If you add the component before the statistical
ner component that you trained, it'll take the predefined entities into account when it makes its predictions. So basically, it will only "predict around them". I haven't tested this yet, but something along the lines of this should work:
entities = 
for match in re.finditer(YOUR_EXPRESSION, doc.text): # find match in text
start, end = match.span() # get the matched token indices
span = doc.char_span(start, end, label="YOUR_LABEL")
doc.ents = list(doc.ents) + entities
Keep in mind that this won't work if the start/end positions matched by your expression do not actually map to valid tokens. In that case, spaCy will raise an error. For instance, if you want to tag
22" in the string
22"xyz, but spaCy doesn't split it in a way that
22" is its own token, you also won't be able to create an entity span for it. In that case, you might need to tweak the tokenization rules a bit, or add a rule to split tokens like this.