Does Textcat PatternMatcher scan all the data?

I've created a custom textcat recipe based on the default textcat.teach. The only difference are -
the model is removed
the sort order is prefer high scores .
The idea is that I want to label all the pattern matched cases because the number of positive cases are very small.

I suspect the pattern matcher doesn't scan all data.

When I tested this recipe with 10K cases, it only gave 1 case with matching pattern to label. I know that there are more cases have matching patterns. If I split the 10K file to 5 files and ran the same recipe again each smaller files, I got more matching cases.

Is it a bug? Is there any I can force the PatternMatcher to scan all data?

Here is my Pattern Match code -

components = teach(dataset=dataset, spacy_model=spacy_model,
                   source=source, patterns=patterns, label=label)
"""
Collect the best possible training data for a text classification model
with the model in the loop. Based on your annotations, Prodigy will decide
which questions to ask next.
"""
if spacy_model.startswith("blank:"):
    nlp = spacy.blank(spacy_model.replace("blank:", ""))
else:
    nlp = spacy.load(spacy_model)
#model = TextClassifier(nlp, label, long_text=long_text, init_tok2vec=init_tok2vec)
stream = JSONL(source)
if patterns is None:
    nlp = space.load(spacy_model) 
    #predict = model
    #update = model.update
else:
    matcher = PatternMatcher(
        nlp,
        prior_correct=5.0,
        prior_incorrect=5.0,
        label_span=False,
        label_task=True,
        filter_labels=label,
        combine_matches=True,
        task_hash_keys=("label",),
    )
    matcher = matcher.from_disk(patterns)
    # Combine the textcat model with the PatternMatcher to annotate both
    # match results and predictions, and update both models.
    #predict, update = combine_models(model, matcher)
    predict, update = matcher, matcher.update 
#stream = prefer_uncertain(predict(stream))
stream = prefer_high_scores(predict(stream))
return {
    "view_id": "classification",
    "dataset": dataset,
    "stream": stream,
    "exclude": exclude,
    "update": update
}

Thanks.

The recipe processes one batch at a time and will mix in suggestions from the model with suggestions from the patterns. So it won't be processing the whole stream at once (because the stream could potentially be infinite). It's also applying the prefer_uncertain / prefer_high_scores sorter, which will filter out examples with uncertain / high scores and may discard others to select the best-matching examples. Pattern matches are also assigned a score.

If you're only running the matcher, you probably want to make your stream eg for eg, score in matcher(stream). If your data is very imbalanced and you want to get a lot of matches in upfront, you can add a step where you're only annotating matches and then pretrain your model and use that as the base model for textcat.teach.

I guess I can copy the copy from ner.manual where it's using the matcher. How can I tell whether there is match returned from the Matcher? I'm thinking of use the following code -

pattern_matcher = PatternMatcher(nlp, combine_matches=True, all_examples=True)
pattern_matcher = pattern_matcher.from_disk(patterns)
stream = (eg for _, eg in pattern_matcher(stream))

I guess the following code can annotate the matches only. However if there is no matches in the current batch, how do I jump to the next batch? Right now I got the 'no first example' error after prodigy started. Here is my code -

stream = (eg for _, eg in pattern_matcher(stream) if len(eg['meta']['pattern'])>0)

My above code seems working.
If I started my code with a fresh dataset, I won't get 'no first example' error. I think the program scanned all the data and found all the matches.

1 Like

Glad it worked :slightly_smiling_face: We also just released v1.9.8, which includes the general-purpose match recipe that I talked about in my previous post (which is probably very similar to the custom recipe you wrote).

And yes, the matcher will go through the stream and yield matches – and once a full batch is available (or nothing is left), Prodigy will send that out for annotation. If you see the "no first batch" error, this typically means that there's nothing in the stream. Either because there are no matches, or because all candidates are already in the dataset and annotated.

Yes, I tried Prodigy 1.9.8. The pattern match worked! I need to use the -C option in order to get all the pattern matched cases. If I don't use the -C option then only cases with model score around 0.5 will be available for annotation.

Thanks a lot for this new function!