I've created a custom textcat recipe based on the default textcat.teach. The only difference are -
the model is removed
the sort order is prefer high scores .
The idea is that I want to label all the pattern matched cases because the number of positive cases are very small.
I suspect the pattern matcher doesn't scan all data.
When I tested this recipe with 10K cases, it only gave 1 case with matching pattern to label. I know that there are more cases have matching patterns. If I split the 10K file to 5 files and ran the same recipe again each smaller files, I got more matching cases.
Is it a bug? Is there any I can force the PatternMatcher to scan all data?
Here is my Pattern Match code -
components = teach(dataset=dataset, spacy_model=spacy_model,
source=source, patterns=patterns, label=label)
"""
Collect the best possible training data for a text classification model
with the model in the loop. Based on your annotations, Prodigy will decide
which questions to ask next.
"""
if spacy_model.startswith("blank:"):
nlp = spacy.blank(spacy_model.replace("blank:", ""))
else:
nlp = spacy.load(spacy_model)
#model = TextClassifier(nlp, label, long_text=long_text, init_tok2vec=init_tok2vec)
stream = JSONL(source)
if patterns is None:
nlp = space.load(spacy_model)
#predict = model
#update = model.update
else:
matcher = PatternMatcher(
nlp,
prior_correct=5.0,
prior_incorrect=5.0,
label_span=False,
label_task=True,
filter_labels=label,
combine_matches=True,
task_hash_keys=("label",),
)
matcher = matcher.from_disk(patterns)
# Combine the textcat model with the PatternMatcher to annotate both
# match results and predictions, and update both models.
#predict, update = combine_models(model, matcher)
predict, update = matcher, matcher.update
#stream = prefer_uncertain(predict(stream))
stream = prefer_high_scores(predict(stream))
return {
"view_id": "classification",
"dataset": dataset,
"stream": stream,
"exclude": exclude,
"update": update
}
Thanks.