Hi Prodigy forum. Newbie here (in more ways than one). I'm just getting started trying to train a text classifier. I'm working with a starting set of maybe 20 seed terms and a large dataset of video transcripts (in 50-word segments). When I fired up textcat.teach, the first maybe three segments it showed me contained one of the seed terms, but none of them were relevant for the type of speech I'm looking for, so I marked them accordingly. It seems, however, that it just proceeded to display all of the segments in the dataset, in order, from that point onwards. In other words, although there are many more seed term matches in the dataset, it doesn't seem to be prioritizing them in any way, although it does highlight them in those rare cases when they do appear.
Now, I will say that after looking at about 800, only 1 matched the kind of speech I was looking for. Might the problem be that it needs more positive feedback (i.e. do I need to refine the dataset)?
It seems, however, that it just proceeded to display all of the segments in the dataset, in order, from that point onwards. In other words, although there are many more seed term matches in the dataset, it doesn't seem to be prioritizing them in any way, although it does highlight them in those rare cases when they do appear.
That might be a bug, so I'm wondering if there's a way for me to reproduce your experience. Is it possible for you to send a few examples that demonstrate that the patterns get highlighted but not prioritised?
A small note; you mentioned the word "seed" in your description. You may have seen the video in the docs shown here. To repeat what's listed under the video: when this video was created, Prodigy didn’t yet support a --patterns option for textcat.teach, only exact string matches with seed terms.
Instead of --seeds, you can now use --patterns with more abstract token descriptions like [{"lemma": "idiot"}] to match all mentions of “idiot”, “idiots”, “IDIOTS” and so on. This might help you match more interesting documents without having to spell out every possible seed word.
Thanks for the response. How should I go about sharing a subset of the material I'm working with? Do you mean just put a sample in a Google Drive or something?
Regarding patterns, I'm looking forward to playing with the patterns file, but for now I'm trying to keep it simple. So I did start with the "training an insults classifier" video and found that note which led me to terms.to-patterns. The resulting file is what I'm working with.
I don't need the entire dataset, just enough to be able to reproduce the behaviour locally. Just a few rows of a jsonl file could be sufficient! If you're unable to send me that data because of privacy/sensitivity issues it's also fine to work with another set. The main is that I would like to confirm the behavior. Could you also share some examples of your patterns? It may be that your patterns file has a typo in it.
Ok. I've put the first several lines from the patterns file and the dataset in a Pastebin here. It's looking through YouTube transcripts for denigrating language about immigrants, and the dataset is a collection of YouTube transcripts in overlapping segments. Given the subject matter we're looking for, I should provide a warning that there might be offensive content here. In these first several segments, I don't think there is, in part because it's a barely intelligible, poor transcript. That probably shouldn't matter for the question at hand, though? Thanks.
I'm so sorry! Somehow this thread fell off my radar. My bad!
I just tried the Pastebin link and it seems to be password protected. Could you check?
Reading your original question again I wonder if there's perhaps also another route worth considering. I'm assuming that you're perhaps interested in detecting emotions in the text? If so, you might also be able to use a pre-trained sentiment model to filter a subset that could be of interest. If, for example, you're interested in detecting "anger", then you may be able to first filter the examples with negative sentiment scores as a starting point.