Bootstrapping using rule-based matching - handling conflicting patterns within single text

Hello there,

I would like to use rule-based matching for bootstrapping single-label multi-class text classification.
If I get it right, prodigy uses the paterns to label texts containing a defined patterned with corresponding label. Yet, how are the cases when a single text contains multiple patterns from different categories handled? Do you recommend to specify fewer seed words to avoid these cases at all costs? Thank you!!

Trying to understand the dynamics, I have defined over 1000 patterns for one of my categories. As a result, the very same texts are being suggested again and again (based on different patterns).

Out of this, I suppose that the text will be suggested for each of the conflicting patterns.

So far, I didn't find any neat solution how to prevent displaying the very same text under the identical label, just with a different pattern number...

Hi @janp,

This question has come up before, so we've been thinking about how to add some extra options to the built-in recipe to control this. However, one of the ideas behind Prodigy is that everyone wants slightly different behaviours, and the easiest way to get what you want is to put the pieces together yourself into a custom recipe.

You can find a discussion of how to filter the stream to prevent the duplicate texts in this thread: textcat.teach presents same annotation task if text snippet contains multiple patterns . I think if you add the stream filter Ines is suggesting there, it should ensure that you're not asked the redundant questions.

One thing to keep in mind is, since you're doing a multilabel problem, you'll want to make sure it can ask you about different text/label combinations. So you want to make sure you're keying the filter by both the text and the label.

Hey @honnibal,

thank you for the tip!

Could I annotate the dataset one label after another, instead? Is there any difference between running textcat.teach --label LABEL_A,LABEL_B and running it first for the former label and then, for the latter?

You can definitely run it one label after another. That'll work fine -- and is actually what we'd recommend in many situations.