Bootstrapping using rule-based matching - handling conflicting patterns within single text

JanP · October 29, 2019, 12:48pm

Hello there,

I would like to use rule-based matching for bootstrapping single-label multi-class text classification.
If I get it right, prodigy uses the paterns to label texts containing a defined patterned with corresponding label. Yet, how are the cases when a single text contains multiple patterns from different categories handled? Do you recommend to specify fewer seed words to avoid these cases at all costs? Thank you!!

JanP · November 1, 2019, 3:41pm

Trying to understand the dynamics, I have defined over 1000 patterns for one of my categories. As a result, the very same texts are being suggested again and again (based on different patterns).

Out of this, I suppose that the text will be suggested for each of the conflicting patterns.

So far, I didn't find any neat solution how to prevent displaying the very same text under the identical label, just with a different pattern number...

honnibal · November 1, 2019, 4:12pm

Hi @janp,

This question has come up before, so we've been thinking about how to add some extra options to the built-in recipe to control this. However, one of the ideas behind Prodigy is that everyone wants slightly different behaviours, and the easiest way to get what you want is to put the pieces together yourself into a custom recipe.

You can find a discussion of how to filter the stream to prevent the duplicate texts in this thread: textcat.teach presents same annotation task if text snippet contains multiple patterns . I think if you add the stream filter Ines is suggesting there, it should ensure that you're not asked the redundant questions.

One thing to keep in mind is, since you're doing a multilabel problem, you'll want to make sure it can ask you about different text/label combinations. So you want to make sure you're keying the filter by both the text and the label.

JanP · November 1, 2019, 4:59pm

Hey @honnibal,

thank you for the tip!

Could I annotate the dataset one label after another, instead? Is there any difference between running textcat.teach --label LABEL_A,LABEL_B and running it first for the former label and then, for the latter?

honnibal · November 1, 2019, 5:05pm

You can definitely run it one label after another. That'll work fine -- and is actually what we'd recommend in many situations.

Topic		Replies	Views
textcat.teach repeatedly annotating the same text, not annotating entire text at once usage , textcat	1	623	November 22, 2019
Seeds for text classification appearing multiple times usage , textcat	1	666	June 27, 2019
textcat.teach repeating data with --exclude flag set and trained model in the loop usage , textcat , solved	9	744	September 25, 2019
Same task presented for every pattern match enhancement , textcat	1	559	November 30, 2019
textcat.teach - Patterns not filtering Label enhancement , textcat , done , solved	8	744	January 11, 2019

Bootstrapping using rule-based matching - handling conflicting patterns within single text

Related topics