Is there a way tho write the matches from ner.match to a jsonl file without getting to the interface? (I do not see a “–output” argument.)
I would like to remove the overlapping spans and keep only the longest span. For example on the interface I get “PS-21” once and “PS-21 slips” once and “slips” once from the same tokens. I just want to keep “PS-21 slips”.
From spacy matching I wrote a function to remove the overlapping spans from the jsonl file. is there something that will consider only the longest match? or to write jsonl directly from ner.match? either one will be very helpful because i already have a function that removes overlapping spans from jsonl and considering only the longest span.
The main idea of
ner.match is to give you an interface so you or your annotators can accept and reject matches to bootstrap training sets with positive and negative examples, and to allow creating training data from patterns that produce false positives and to explore patterns interactively.
If you only want to create matches based on patterns, you could just use spaCy’s
Matcher directly and save the matches as JSONL? If you do want to annotate with Prodigy but with custom match logic (or any other rules), you could also write your own custom recipe that implements your logic and only yields out examples that you want. Here’s an example of how the stream could be generated:
for doc in nlp.pipe(texts): # pipe your texts through spaCy
matches = matcher(doc)
for match_id, start, end in matches:
span = doc[start:end]
# your custom logic here to decide if you want the match
# use the pattern name as the match label
Do you have an example of the patterns you use? Because unless you have patterns for both spans, or use operators (via the
"OP" key), you should only see the actual matches, not partial ones.
Doc objects as match patterns, which makes it more efficient than the token-based matcher, and makes it easier to create patterns (because you don’t have to worry about tokenization and token attributes). However, it also means that it only matches exact phrases.
Okay, that makes sense! If you aready have your own script, I’d definitely suggest using spaCy’s
Matcher directly and then yielding out the examples in Prodigy’s format. You should be able to do this pretty quickly using a custom recipe.
Spacy Matcher Worked! Thank you @ines
But instead of writing a custom recipe, i am creating the jsonl format from the matcher itself.