Hi again! After creating a new dataset, I tried bootstrapping it but ran into the following error:
ValueError: Tried to find at least 10 examples containing the 87 seed terms provided, but only found 0 matches. Gave up after searching 10000 examples from the stream.
I tried different methods to yield positive results such as removing punctuation from the terms list and removing terms that had spaces. I even manually searched for some of the terms in the dataset I’m annotating and there are indeed questions that contain the seeds/keywords. Any idea what could be wrong here?
Which version of Prodigy are you using? The error message sounds like you’re still on an older version that only supports string seed terms and not yet match patterns, like the NER recipes.
In the latest version, textcat.teach lets you provide a patterns file and describe the individual tokens you’re looking for. This means you can also handle multi-word tokens, case sensitivity vs. insensitivity and even use other token attributes like lemmas or boolean flags. Here are some examples:
The first pattern would present texts containing the tokens “white house”, “White House” etc. (matched only on the lowercase form) for the label POLITICS. The second pattern would find texts containing tokens with the lemma "buy "(e.g. “bought”, “buying”) for the label SALE.
The previous approach of only matching exact strings was slightly limiting, which we’ve replaced it with the more flexible patterns solution.
You can also use the terms.teach recipe to create terminology lists from word vectors and then convert those to match patterns using terms.to-patterns. This might also help with your other question: you can start off with a few seed terms and use word vectors to find other, similar terms you maybe didn’t think of (even misspellings – I’m sometimes surprised how common some of them are, and it’s really difficult to guess how people may misspell some word).