Text Classification, Bootstrapping Error

Which version of Prodigy are you using? The error message sounds like you’re still on an older version that only supports string seed terms and not yet match patterns, like the NER recipes.

In the latest version, textcat.teach lets you provide a patterns file and describe the individual tokens you’re looking for. This means you can also handle multi-word tokens, case sensitivity vs. insensitivity and even use other token attributes like lemmas or boolean flags. Here are some examples:

{"label": "POLITICS", "pattern": [{"lower": "white"}, {"lower": "house"}]}
{"label": "SALE", "pattern": [{"lemma": "buy"}]}

The first pattern would present texts containing the tokens “white house”, “White House” etc. (matched only on the lowercase form) for the label POLITICS. The second pattern would find texts containing tokens with the lemma "buy "(e.g. “bought”, “buying”) for the label SALE.

The previous approach of only matching exact strings was slightly limiting, which we’ve replaced it with the more flexible patterns solution.

You can also use the terms.teach recipe to create terminology lists from word vectors and then convert those to match patterns using terms.to-patterns. This might also help with your other question: you can start off with a few seed terms and use word vectors to find other, similar terms you maybe didn’t think of (even misspellings – I’m sometimes surprised how common some of them are, and it’s really difficult to guess how people may misspell some word).