Text Classification, Bootstrapping Error

ines · June 7, 2018, 4:00pm

Which version of Prodigy are you using? The error message sounds like you’re still on an older version that only supports string seed terms and not yet match patterns, like the NER recipes.

In the latest version, textcat.teach lets you provide a patterns file and describe the individual tokens you’re looking for. This means you can also handle multi-word tokens, case sensitivity vs. insensitivity and even use other token attributes like lemmas or boolean flags. Here are some examples:

{"label": "POLITICS", "pattern": [{"lower": "white"}, {"lower": "house"}]}
{"label": "SALE", "pattern": [{"lemma": "buy"}]}

The first pattern would present texts containing the tokens “white house”, “White House” etc. (matched only on the lowercase form) for the label POLITICS. The second pattern would find texts containing tokens with the lemma "buy "(e.g. “bought”, “buying”) for the label SALE.

The previous approach of only matching exact strings was slightly limiting, which we’ve replaced it with the more flexible patterns solution.

You can also use the terms.teach recipe to create terminology lists from word vectors and then convert those to match patterns using terms.to-patterns. This might also help with your other question: you can start off with a few seed terms and use word vectors to find other, similar terms you maybe didn’t think of (even misspellings – I’m sometimes surprised how common some of them are, and it’s really difficult to guess how people may misspell some word).

Topic		Replies	Views
textcat.teach repeatedly annotating the same text, not annotating entire text at once usage , textcat	1	569	November 22, 2019
Is there a way to highlight seeded terms in textcat.teach? enhancement , textcat , done	5	1710	January 29, 2020
Seeds not recognized by textcat.teach textcat , solved	10	3157	January 23, 2019
Textcat.teach not using the pattern file enhancement , textcat , done	10	1814	September 20, 2022
Bootstrapping using rule-based matching - handling conflicting patterns within single text usage , textcat	4	540	November 1, 2019

Text Classification, Bootstrapping Error

Related Topics