What is the reasoning behind duplicate labeling per pattern?

I am doing a text classification task for long text classification. Currently, I am being shown each text multiple times in a row for each pattern that is detected. What is the reasoning behind this? Is there an advantage to it? Can I maybe turn it off?

1 Like

Also, if I label one text multiple times and each time it highlights a different input pattern should I label based on the text as a whole or whether that specific input pattern is good?

Actually, there's not really a deeper reason behind this – it's mostly due to the way the pattern matcher works. Every match is treated as its own "entity". This makes sense for NER – but you're right that for text classification, it would be good to have an option to reconcile matches related to the same text and the same label (even if produced by different trigger patterns). The solution would have to be based on the input hash (unique ID of the input data, e.g. the text) and the label of the match.

In the meantime, you could implement a filter for this yourself that takes the stream of the combined models and at least filter out the duplicates that were already covered by other matches. (Actually reconciling them is a little more difficult, because streams are generators and I'm not sure you can always rely on matches coming in in perfect order). I haven't tested this yet, but something along those lines should work:

def filter_matches(stream):
    seen = set()  # (id, label) pairs of tasks that we've already seen
    for eg in stream:
        if 'pattern' in eg['meta']:  # example was produced by pattern
            match = (eg['_input_hash'], eg['label'])
            if match not in seen:
                # we haven't had a combination of that text and label yet
                seen.add(match)
                yield eg
        else:
            yield eg

If you're doing text classification, it's only about the label in relation to the whole text you're seeing. That's also what you'll be training your model on later.

1 Like

@ines hi I started annotating my dataset using the textcat.teach option I also face the same issue. My question is about one more thing. The system treats these as seprate annotations as far as i can understand so if i do eval_split then theres a chance that same text with same annotation might pop up in both train and eval. is there anyway to clean duplicates?. Also ignore seems to be counted as well. Looking forward to your reply

1 Like

@sarim-zafar You’re right that that can be a problem. The most robust way to get an evaluation set is to partition up the data prior to annotating. If you evaluate with texts from textcat.teach, you’re evaluating over a biased sample of data (since the examples were selected using the patterns and the model.) This can still be useful, especially for quick-and-dirty evaluations – and of course it beats evaluating on the training data. But it’s still not ideal.

You can filter duplicates with the prodigy.components.filters.filter_duplicates utility function. I would recommend writing a quick custom recipe to read in the dataset, clean it up as you require with filters etc, and save it out to a new dataset. This will let you curate these problems using custom logic, which is usually pretty easy to express in Python.

Even though the problem is easily fixed, I agree that the behaviour is unideal. We hope to improve this in future versions.

2 Likes