Text Classification, Bootstrapping Error

paige · June 7, 2018, 2:06pm

Hi again! After creating a new dataset, I tried bootstrapping it but ran into the following error:

ValueError: Tried to find at least 10 examples containing the 87 seed terms provided, but only found 0 matches. Gave up after searching 10000 examples from the stream.

I tried different methods to yield positive results such as removing punctuation from the terms list and removing terms that had spaces. I even manually searched for some of the terms in the dataset I’m annotating and there are indeed questions that contain the seeds/keywords. Any idea what could be wrong here?

ines · June 7, 2018, 4:00pm

Which version of Prodigy are you using? The error message sounds like you’re still on an older version that only supports string seed terms and not yet match patterns, like the NER recipes.

In the latest version, textcat.teach lets you provide a patterns file and describe the individual tokens you’re looking for. This means you can also handle multi-word tokens, case sensitivity vs. insensitivity and even use other token attributes like lemmas or boolean flags. Here are some examples:

{"label": "POLITICS", "pattern": [{"lower": "white"}, {"lower": "house"}]}
{"label": "SALE", "pattern": [{"lemma": "buy"}]}

The first pattern would present texts containing the tokens “white house”, “White House” etc. (matched only on the lowercase form) for the label POLITICS. The second pattern would find texts containing tokens with the lemma "buy "(e.g. “bought”, “buying”) for the label SALE.

The previous approach of only matching exact strings was slightly limiting, which we’ve replaced it with the more flexible patterns solution.

You can also use the terms.teach recipe to create terminology lists from word vectors and then convert those to match patterns using terms.to-patterns. This might also help with your other question: you can start off with a few seed terms and use word vectors to find other, similar terms you maybe didn’t think of (even misspellings – I’m sometimes surprised how common some of them are, and it’s really difficult to guess how people may misspell some word).

Topic		Replies	Views
textcat.teach repeatedly annotating the same text, not annotating entire text at once usage , textcat	1	585	November 22, 2019
Is there a way to highlight seeded terms in textcat.teach? enhancement , textcat , done	5	1726	January 29, 2020
Seeds not recognized by textcat.teach textcat , solved	10	3172	January 23, 2019
Textcat.teach not using the pattern file enhancement , textcat , done	10	1831	September 20, 2022
Bootstrapping using rule-based matching - handling conflicting patterns within single text usage , textcat	4	550	November 1, 2019

Text Classification, Bootstrapping Error

Related Topics