My goal to classify the entire text, not just specific tokens or keyphrases, which doesn't seem to be what Prodigy is doing here (the highlighted words suggest that perhaps I'm labeling specific words? Or something?).
Additionally, for texts that contain many of my seed terms, this requires that I annotate each example multiple times.
If i exclude the patterns argument, my interface looks like yours in the video, but it seems like it would be a shame to completely skip bootstrapping with "seeds". As an opinionated aside: I do like "seeds" much more than "patterns" for textcat, as "patterns" seems more related to categorization of specific tokens, spans, or entities, while "seeds" seems more clearly to reference vectors used to classify entire docs.
Hi! The highlighted text is the matched pattern that was used to select that example. (When I recorded my video, Prodigy didn't yet highlight the pattern that was actually matched, which people found a bit confusing. The recipe now does that to make it more transparent that the example was selected based on a specific match in the text). You're still annotating the text plus label, and when you train your model, you'll be training on the text plus label, too. The highlight is just there so you know what the suggestion is based on.
That's interesting, because I always feel like writing abstract patterns is actually much more useful for text classification than it is for NER. For entities, you often have a pretty specific idea of what the spans should be, so the main token attributes you'd probably want to use are the token text and maybe the lowercase form (to make them case-insensitive). But if you're assigning labels to the whole text, the "trigger words" or phrases are often much more vague and can be stuff like "word with the lemma sell" or "this noun with optional adjective X, Y or Z". That's where token-based patterns make a lot more sense than just more or less exact string matches. But I guess it really depends on the use case.