Textcat.teach not using the pattern file

Hi,
I’m trying the following pattern file :
{“label”: “NEG”, “pattern”: [{“lower”: “depressed”}]}
{“label”: “NEG”, “pattern”: [{“lower”: “upset”}]}
{“label”: “NEG”, “pattern”: [{“lower”: “sad”}]}

However, the text coming up doesn’t show anything related to these seeds. Even if a seed word is present in any text coming up(which is rare) it is not highlighted. I have tried testing the pattern matcher and seems like these seeded words are matched in the sentence. However prodigy is not coming up with the text which is based on these matches.
I’m using same label- NEG , both in pattern file and while starting prodigy and saved the pattern file as .jsonl file.

I tried verifying my pattern file in your pattern demo and came up with the following file:
{“label”: “NEG”, “pattern”: [{“Lower”: “upset”, “OP”: “?”}]}
{“label”: “NEG”, “pattern”: [{“Lower”: “sad”, “OP”: “?”}]}
{“label”: “NEG”, “pattern”: [{“Lower”: “depress”, “OP”: “?”}]}

Even after using this pattern file, there was no luck!

Can you please help?
Thanks

Hi! I think the behaviour you're seeing might be related to what I describe in this thread:

In short, if you're starting from scratch with very unbalanced classes or a very large corpus with fewer match candidates, it can happen that not enough initial matches are produced and that matches found later on are skipped due to their score.

When you run a teach recipe with patterns, Prodigy will combine the pattern matches and the model's suggestions. If no matches are found in a batch of examples, Prodigy will only yield the model's suggestions, which can be very random if the model hasn't learned anything yet.

The pattern matcher also assigns scores to the matches, based on how reliably they produce a match. This makes sense for lots of patterns and matches, because you still want to focus on the most important examples – but in other cases with low match density, this can cause the active learning algorithm to actually skip the few existing matches.

I've discussed some solutions in the thread linked above – for example, for the next release, we'll be updating the logic used to sort and merge the matcher and model, to prevent matches from being skipped. In the meantime, you could try and use a separate step to bootstrap the model. The main problem that the patterns are trying to solve is the cold start: you need enough initial training examples for the model to make meaningful suggestions. So you could first find the matches and bootstrap the initial training set, pre-train the model with that data and then use textcat.teach to improve it. One idea could be to repurpose the ner.match recipe and add a "label": "NEG" to the selected examples. You could also check out the recipe source and write your own, or implement a different matching logic with regular expressions etc.

Thanks for your reply.
I tried it on a small data set also, containing only 20-30 examples and with the pattern file above in my query. It still shows very few relevant examples and does not highlights the seeds present in the sentence.
Any reason why seeds are not highlighted?
I also used this solution Is there a way to highlight seeded terms in textcat.teach? but it doesn’t seem to work!

Which version of Prodigy are you using? And if you check in the bottom right corner of the annotation card, does it show “Pattern” with an ID, or not?

When you run textcat.teach with patterns, you’re not only seeing the pattern matches – you’re usually seeing pattern matches and examples suggested by the model. So it’s possible that the model will also score the examples containing your trigger phrases, and suggest them for annotation. Especially in the beginning and if you’re starting out with a “blank” model that knows nothing about your categories yet and pretty much assumes the same scores for everything.

I’m using prodigy 1.5.1
On bottom right corner, it just shows the score.
Yes, model also score the examples containing the pattern, but out of all the examples none of them had highlighted seeds.

Okay, so that definitely means that the suggestion came from the model, not from the pattern. Internally, Prodigy will aggregate two streams of examples: one from the model and one from the pattern matcher. The sorter will then yield out the “most relevant” examples, so it likely skipped the pattern matches and only showed you the ones from the model. The result will be similar, because you’re still seeing the same example and are annotating the same decision – but it’s obviously not as nice.

Aside from excluding the pattern matches from the sorter, we’ve also been thinking about an option to make the pattern matches have precedence over the model suggestions, at least in the textcat recipes. That way, Prodigy would always show the pattern match and always skip the model suggestion in favour of the pattern match, if both are available.

Thanks for your quick reply. So that thread is the only solution for this problem- Is there a way to highlight seeded terms in textcat.teach? ?

Just want to share one finding.
When I replaced prefer_uncertain with prefer_low_scores for sorter in textcat.teach recipe, It highlighted the seeded words. However now if the sentence has two patterns, it is showing up that sentence twice to the user.

Also, now prodigy is showing more relevant examples which has seeds .(after using prefer_low_score).

What do you mean by this?
To make this edit in ner.match recipe?

return {
‘view_id’: ‘ner’,
’label’: ‘NEG’,
‘dataset’: dataset,
‘stream’: (eg for _, eg in model(stream)),
‘exclude’: exclude

}

Thanks for the analysis – this makes sense and is consistent with what I suspected above: since the pattern matches also receive a score, they are filtered out if they’re not considered “relevant” enough. This makes sense if there’s a lot of incoming data – but not so much if you’re starting from scratch. So that’s definitely something we want to optimise and provide more settings for.

No, this thread describes a solution for an old version of Prodigy that didn’t yet support the full highlighting for textcat recipes and only highlighted the terms for ner. As I mention in my comment here, this update was shipped in v1.4.0.

If you just want to find matches in your data to pre-train the model, I would suggest repurposing the ner.match recipe which does exactly that: it takes patterns, finds the matches and asks you for feedback.

Sorry if my description was unclear. I meant editing the data you collect afterwards to add a "label", so you can use the data in textcat.batch-train. For example, once you’re done with ner.match, you can export the data:

prodigy db-out your_match_dataset > data.jsonl

Then run a quick search and replace and add "label": "NEG" to each entry in the JSONL and add a new dataset for the converted annotations. You can then pre-train your text classification model from that:

prodigy dataset textcat_match_dataset "Converted dataset with added labels"
prodigy db-in textcat_match_dataset data_converted.jsonl
prodigy textcat.batch-train textcat_match_dataset ... # etc

Once you have a model that’s learned a bit more about your "NEG" label, you can load it into textcat.teach and start improving the model, without the immediate need to use patterns for bootstrapping.

Did you implement this where the pattern takes precedence?