textcat.teach repeatedly annotating the same text, not annotating entire text at once

trevorwelch · November 21, 2019, 10:15pm

In following along with the insults classifier video, at 9:56 when you've just launched textcat and flip to the Prodigy annotation interface, it appears that you are annotating the entire example:

However, when I launch my textcat in the same way (albeit edited, to follow the new API as described in the YouTube comments):

prodigy textcat.teach my-new-dataset en_core_web_lg ./data/social_text_data_2.jsonl --label MY_LABEL --patterns ./data/my_seed_terms.jsonl

The first example is like so:

And the second example is like so:

My goal to classify the entire text, not just specific tokens or keyphrases, which doesn't seem to be what Prodigy is doing here (the highlighted words suggest that perhaps I'm labeling specific words? Or something?).

Additionally, for texts that contain many of my seed terms, this requires that I annotate each example multiple times.

If i exclude the patterns argument, my interface looks like yours in the video, but it seems like it would be a shame to completely skip bootstrapping with "seeds". As an opinionated aside: I do like "seeds" much more than "patterns" for textcat, as "patterns" seems more related to categorization of specific tokens, spans, or entities, while "seeds" seems more clearly to reference vectors used to classify entire docs.

What am I doing wrong here?

ines · November 22, 2019, 1:52am

Hi! The highlighted text is the matched pattern that was used to select that example. (When I recorded my video, Prodigy didn't yet highlight the pattern that was actually matched, which people found a bit confusing. The recipe now does that to make it more transparent that the example was selected based on a specific match in the text). You're still annotating the text plus label, and when you train your model, you'll be training on the text plus label, too. The highlight is just there so you know what the suggestion is based on.

The pattern matcher currently just yields out every match, so if multiple matches occur in the same text, you see each example once. We do want to change this for the next update that allows us to break backwards-compatibility. In the meantime, you can find more details and code for a filter function in this thread: textcat.teach presents same annotation task if text snippet contains multiple patterns - #2 by ines

That's interesting, because I always feel like writing abstract patterns is actually much more useful for text classification than it is for NER. For entities, you often have a pretty specific idea of what the spans should be, so the main token attributes you'd probably want to use are the token text and maybe the lowercase form (to make them case-insensitive). But if you're assigning labels to the whole text, the "trigger words" or phrases are often much more vague and can be stuff like "word with the lemma sell" or "this noun with optional adjective X, Y or Z". That's where token-based patterns make a lot more sense than just more or less exact string matches. But I guess it really depends on the use case.

Topic		Replies	Views
Seeds not recognized by textcat.teach textcat , solved	10	3283	January 23, 2019
Pattern files for textcat.teach usage , textcat	20	3757	July 6, 2018
Seeding text categorization with phrases textcat , done , custom	9	4208	March 21, 2018
No tasks available in v1.10 - texcat.teach usage , textcat	4	841	June 28, 2020
Textcat.teach not using the pattern file enhancement , textcat , done	10	1922	September 20, 2022

textcat.teach repeatedly annotating the same text, not annotating entire text at once

Related topics