ner.match to jsonl without getting into the interface

Is there a way tho write the matches from ner.match to a jsonl file without getting to the interface? (I do not see a “–output” argument.)

I would like to remove the overlapping spans and keep only the longest span. For example on the interface I get “PS-21” once and “PS-21 slips” once and “slips” once from the same tokens. I just want to keep “PS-21 slips”.

From spacy matching I wrote a function to remove the overlapping spans from the jsonl file. is there something that will consider only the longest match? or to write jsonl directly from ner.match? either one will be very helpful because i already have a function that removes overlapping spans from jsonl and considering only the longest span.

The main idea of ner.match is to give you an interface so you or your annotators can accept and reject matches to bootstrap training sets with positive and negative examples, and to allow creating training data from patterns that produce false positives and to explore patterns interactively.

If you only want to create matches based on patterns, you could just use spaCy’s Matcher directly and save the matches as JSONL? If you do want to annotate with Prodigy but with custom match logic (or any other rules), you could also write your own custom recipe that implements your logic and only yields out examples that you want. Here’s an example of how the stream could be generated:

def get_stream():
    for doc in nlp.pipe(texts):  # pipe your texts through spaCy
        matches = matcher(doc)
        for match_id, start, end in matches:
            span = doc[start:end]
            # your custom logic here to decide if you want the match
            yield {
                'text': doc.text, 
                'spans': [{
                    'start': span.start_char, 
                    'end': span.end_char,
                     # use the pattern name as the match label
                    'label': doc.vocab.strings[match_id] 

Do you have an example of the patterns you use? Because unless you have patterns for both spans, or use operators (via the "OP" key), you should only see the actual matches, not partial ones.

1 Like

Hi Ines,

  1. Thank you. I was checking if creating a jsonl was possible (and i am missing it). Because with spaCy Phrase Matcher case insenstitve match was not possible. (probably because the vector for upper case is different from lower case?) Will try it out with Matcher to do the same.

  2. I do have “PS-21 slips” and “PS-21” and “slips” in my dictionary. Because if those terms terms is missing, i do not want to miss it. so, i am doing a longest possible match in the sorted dictionary. (The dictionary is about ~300) This is required for me because these are free-written domain specific comments and people might assume some words to be understood for a person who knows the domain.

Yes, the PhraseMatcher uses Doc objects as match patterns, which makes it more efficient than the token-based matcher, and makes it easier to create patterns (because you don’t have to worry about tokenization and token attributes). However, it also means that it only matches exact phrases.

Okay, that makes sense! If you aready have your own script, I’d definitely suggest using spaCy’s Matcher directly and then yielding out the examples in Prodigy’s format. You should be able to do this pretty quickly using a custom recipe.

Spacy Matcher Worked! Thank you @ines
But instead of writing a custom recipe, i am creating the jsonl format from the matcher itself.

1 Like