textcat.teach repeating data with --exclude flag set and trained model in the loop

I'm having trouble implementing a single category text categorization training project over multiple sessions. I'm trying to use a set of seed vocabulary and textcat.teach, but when I re-start teaching sessions, I'm seeing a lot of repeated data, very similar to this issue. My workflow is as follows:

###create patterns file from vocab list
$ prodigy terms.to-patterns my_seeds my_seeds_patterns.jsonl --label MY_LABEL
###create dataset
prodigy dataset my_dataset "description"
###initialize training process based on patterns
$prodigy textcat.teach my_dataset en_core_web_sm my_text_data.jsonl --label MY_LABEL --patterns my_seeds_patterns.jsonl
###train model on annotations from first teaching session
prodigy textcat.batch-train my_dataset --output /tmp/model --eval-split 0.2
###restart second training session 
prodigy textcat.teach my_dataset tmp/model my_text_data.jsonl --label MY_LABEL --exclude my_dataset --patterns my_seeds_patterns.jsonl

However, I'm still seeing a lot or repeated values when I train on the second session. Am I missing something? Is this the expected behaviour?

Hi! So when you see the duplicate examples, are those examples containing pattern matches? If so, this is kind of expected at the moment, because the same example with a different match will currently receive a different hash (which makes sense, because they are different "questions").

If that's what's happening, you could choose to filter the incoming examples by their input hashes and only send out a question if its text hasn't been annotated before (regardless of which label or pattern match it's asking you about). Here's the idea:

from prodigy import set_hashes
from prodigy.components.db import connect

# In your recipe function
db = connect()
input_hashes = db.get_input_hashes(dataset)

def filter_stream(stream):
    for eg in stream:
         eg = set_hashes(eg)
         if eg["_input_hash"] not in input_hashes:
             yield eg
1 Like

Thanks for your reply, but I see duplicates whether or not I use pattern matching, so that can’t be the whole story...

Also, I’m seeing duplicates of the “whole text” questions — with no terms highlighted — when I run textcat.teach with the patterns argument.

Let me know if you need more information. Happy to provide.

Thanks for checking! And that's definitely interesting :thinking: If you've indentified one example that appeared twice, could you save it to your dataset, run db-out, find the duplicate examples and share both of them (the full task JSONL)? And which version of Prodigy are you running?

Ok, I will check it out and get back to you! Further question, are when you are tagging a highlighted seed term, are you supposed (i.e. does the model 'assume' you are...) 1) providing a tag for the use of the highlighted seed in that particular sentence/span, or 2) tagging whether the length of text as a whole evinces the label you are building tags for?

Ok, I am embarrassed now. I wasn't pressing save! Doh! Maybe I should have slept on it before posting this thread. Sorry to waste your time, and thanks for the response!

No worries, glad you figured it out! (It sounds like one of these classic situations where you do all the "hard parts" right and then the problems ends up being something basic. I can definitely relate!)

To answer your last question:

You're still training a regular text classifier, so you're updating the model with the whole text and the label. In the background, Prodigy is also updating the pattern matcher so it can keep a record of how relevant a certain pattern was, and later prioritise the more relevant patterns. Patterns are only meant to help with the example selection and what they highlight is not specifically used to update the model (but if the matches are certain "trigger" words or phrases, they may still end up being what the text classifier bases its predictions on).

Thanks so much:) Much appreciated. I realized as I was tagging that I wasn't entirely sure what I was doing. Cheers.