textcat.teach repeating data with --exclude flag set and trained model in the loop

cbjrobertson · September 24, 2019, 4:05am

I'm having trouble implementing a single category text categorization training project over multiple sessions. I'm trying to use a set of seed vocabulary and textcat.teach, but when I re-start teaching sessions, I'm seeing a lot of repeated data, very similar to this issue. My workflow is as follows:

###create patterns file from vocab list
$ prodigy terms.to-patterns my_seeds my_seeds_patterns.jsonl --label MY_LABEL
###create dataset
prodigy dataset my_dataset "description"
###initialize training process based on patterns
$prodigy textcat.teach my_dataset en_core_web_sm my_text_data.jsonl --label MY_LABEL --patterns my_seeds_patterns.jsonl
###train model on annotations from first teaching session
prodigy textcat.batch-train my_dataset --output /tmp/model --eval-split 0.2
###restart second training session 
prodigy textcat.teach my_dataset tmp/model my_text_data.jsonl --label MY_LABEL --exclude my_dataset --patterns my_seeds_patterns.jsonl

However, I'm still seeing a lot or repeated values when I train on the second session. Am I missing something? Is this the expected behaviour?

ines · September 24, 2019, 8:32am

Hi! So when you see the duplicate examples, are those examples containing pattern matches? If so, this is kind of expected at the moment, because the same example with a different match will currently receive a different hash (which makes sense, because they are different "questions").

If that's what's happening, you could choose to filter the incoming examples by their input hashes and only send out a question if its text hasn't been annotated before (regardless of which label or pattern match it's asking you about). Here's the idea:

from prodigy import set_hashes
from prodigy.components.db import connect

# In your recipe function
db = connect()
input_hashes = db.get_input_hashes(dataset)

def filter_stream(stream):
    for eg in stream:
         eg = set_hashes(eg)
         if eg["_input_hash"] not in input_hashes:
             yield eg

cbjrobertson · September 24, 2019, 4:17pm

Thanks for your reply, but I see duplicates whether or not I use pattern matching, so that can’t be the whole story...

cbjrobertson · September 24, 2019, 5:33pm

Also, I’m seeing duplicates of the “whole text” questions — with no terms highlighted — when I run textcat.teach with the patterns argument.

cbjrobertson · September 24, 2019, 5:33pm

Let me know if you need more information. Happy to provide.

ines · September 24, 2019, 6:33pm

Thanks for checking! And that's definitely interesting If you've indentified one example that appeared twice, could you save it to your dataset, run db-out, find the duplicate examples and share both of them (the full task JSONL)? And which version of Prodigy are you running?

cbjrobertson · September 24, 2019, 6:56pm

Ok, I will check it out and get back to you! Further question, are when you are tagging a highlighted seed term, are you supposed (i.e. does the model 'assume' you are...) 1) providing a tag for the use of the highlighted seed in that particular sentence/span, or 2) tagging whether the length of text as a whole evinces the label you are building tags for?

cbjrobertson · September 24, 2019, 7:18pm

Ok, I am embarrassed now. I wasn't pressing save! Doh! Maybe I should have slept on it before posting this thread. Sorry to waste your time, and thanks for the response!

ines · September 24, 2019, 8:54pm

No worries, glad you figured it out! (It sounds like one of these classic situations where you do all the "hard parts" right and then the problems ends up being something basic. I can definitely relate!)

To answer your last question:

You're still training a regular text classifier, so you're updating the model with the whole text and the label. In the background, Prodigy is also updating the pattern matcher so it can keep a record of how relevant a certain pattern was, and later prioritise the more relevant patterns. Patterns are only meant to help with the example selection and what they highlight is not specifically used to update the model (but if the matches are certain "trigger" words or phrases, they may still end up being what the text classifier bases its predictions on).

cbjrobertson · September 25, 2019, 12:43am

Thanks so much:) Much appreciated. I realized as I was tagging that I wasn't entirely sure what I was doing. Cheers.

Topic		Replies	Views
Textcat - same data keeps appearing usage , textcat	3	517	July 23, 2019
textcat.teach presents same annotation task if text snippet contains multiple patterns enhancement , usage , textcat , solved	11	1668	June 3, 2019
Same text appearing twice (with matches and without) textcat	5	464	December 13, 2022
Textcat.teach not using the pattern file enhancement , textcat , done	10	1917	September 20, 2022
Seeds for text classification appearing multiple times usage , textcat	1	667	June 27, 2019

textcat.teach repeating data with --exclude flag set and trained model in the loop

Related topics