Does Textcat PatternMatcher scan all the data?

curious · March 13, 2020, 1:57pm

I've created a custom textcat recipe based on the default textcat.teach. The only difference are -
the model is removed
the sort order is prefer high scores .
The idea is that I want to label all the pattern matched cases because the number of positive cases are very small.

I suspect the pattern matcher doesn't scan all data.

When I tested this recipe with 10K cases, it only gave 1 case with matching pattern to label. I know that there are more cases have matching patterns. If I split the 10K file to 5 files and ran the same recipe again each smaller files, I got more matching cases.

Is it a bug? Is there any I can force the PatternMatcher to scan all data?

Here is my Pattern Match code -

components = teach(dataset=dataset, spacy_model=spacy_model,
                   source=source, patterns=patterns, label=label)
"""
Collect the best possible training data for a text classification model
with the model in the loop. Based on your annotations, Prodigy will decide
which questions to ask next.
"""
if spacy_model.startswith("blank:"):
    nlp = spacy.blank(spacy_model.replace("blank:", ""))
else:
    nlp = spacy.load(spacy_model)
#model = TextClassifier(nlp, label, long_text=long_text, init_tok2vec=init_tok2vec)
stream = JSONL(source)
if patterns is None:
    nlp = space.load(spacy_model) 
    #predict = model
    #update = model.update
else:
    matcher = PatternMatcher(
        nlp,
        prior_correct=5.0,
        prior_incorrect=5.0,
        label_span=False,
        label_task=True,
        filter_labels=label,
        combine_matches=True,
        task_hash_keys=("label",),
    )
    matcher = matcher.from_disk(patterns)
    # Combine the textcat model with the PatternMatcher to annotate both
    # match results and predictions, and update both models.
    #predict, update = combine_models(model, matcher)
    predict, update = matcher, matcher.update 
#stream = prefer_uncertain(predict(stream))
stream = prefer_high_scores(predict(stream))
return {
    "view_id": "classification",
    "dataset": dataset,
    "stream": stream,
    "exclude": exclude,
    "update": update
}

Thanks.

ines · March 13, 2020, 3:34pm

The recipe processes one batch at a time and will mix in suggestions from the model with suggestions from the patterns. So it won't be processing the whole stream at once (because the stream could potentially be infinite). It's also applying the prefer_uncertain / prefer_high_scores sorter, which will filter out examples with uncertain / high scores and may discard others to select the best-matching examples. Pattern matches are also assigned a score.

If you're only running the matcher, you probably want to make your stream eg for eg, score in matcher(stream). If your data is very imbalanced and you want to get a lot of matches in upfront, you can add a step where you're only annotating matches and then pretrain your model and use that as the base model for textcat.teach.

curious · March 13, 2020, 5:45pm

I guess I can copy the copy from ner.manual where it's using the matcher. How can I tell whether there is match returned from the Matcher? I'm thinking of use the following code -

pattern_matcher = PatternMatcher(nlp, combine_matches=True, all_examples=True)
pattern_matcher = pattern_matcher.from_disk(patterns)
stream = (eg for _, eg in pattern_matcher(stream))

curious · March 13, 2020, 6:28pm

I guess the following code can annotate the matches only. However if there is no matches in the current batch, how do I jump to the next batch? Right now I got the 'no first example' error after prodigy started. Here is my code -

stream = (eg for _, eg in pattern_matcher(stream) if len(eg['meta']['pattern'])>0)

curious · March 13, 2020, 7:53pm

My above code seems working.
If I started my code with a fresh dataset, I won't get 'no first example' error. I think the program scanned all the data and found all the matches.

ines · March 14, 2020, 7:04pm

Glad it worked We also just released v1.9.8, which includes the general-purpose match recipe that I talked about in my previous post (which is probably very similar to the custom recipe you wrote).

And yes, the matcher will go through the stream and yield matches – and once a full batch is available (or nothing is left), Prodigy will send that out for annotation. If you see the "no first batch" error, this typically means that there's nothing in the stream. Either because there are no matches, or because all candidates are already in the dataset and annotated.

curious · March 16, 2020, 8:35pm

Yes, I tried Prodigy 1.9.8. The pattern match worked! I need to use the -C option in order to get all the pattern matched cases. If I don't use the -C option then only cases with model score around 0.5 will be available for annotation.

Thanks a lot for this new function!

Topic		Replies	Views
textcat.manual with --patterns argument enhancement , textcat	7	1100	September 25, 2019
Can't get phrase matching to work spancat	3	295	June 27, 2023
Use patterns.jsonl to automatically annotate entire dataset spancat	6	513	October 20, 2022
Create PhraseMatcher in Spacy and use them to Label data manually ner , spacy , solved , medical	9	1565	December 15, 2020
Is there a way to highlight seeded terms in textcat.teach? enhancement , textcat , done	5	1803	January 29, 2020

Does Textcat PatternMatcher scan all the data?

Related topics