The main idea of ner.match
is to give you an interface so you or your annotators can accept and reject matches to bootstrap training sets with positive and negative examples, and to allow creating training data from patterns that produce false positives and to explore patterns interactively.
If you only want to create matches based on patterns, you could just use spaCy's Matcher
directly and save the matches as JSONL? If you do want to annotate with Prodigy but with custom match logic (or any other rules), you could also write your own custom recipe that implements your logic and only yields out examples that you want. Here's an example of how the stream could be generated:
def get_stream():
for doc in nlp.pipe(texts): # pipe your texts through spaCy
matches = matcher(doc)
for match_id, start, end in matches:
span = doc[start:end]
# your custom logic here to decide if you want the match
yield {
'text': doc.text,
'spans': [{
'start': span.start_char,
'end': span.end_char,
# use the pattern name as the match label
'label': doc.vocab.strings[match_id]
}]
}
Do you have an example of the patterns you use? Because unless you have patterns for both spans, or use operators (via the "OP"
key), you should only see the actual matches, not partial ones.