Ah, sorry about that! If you're using the latest version of Prodigy, the --seeds argument has been replaced with a --patterns argument – see here for the available recipe arguments, and here for an example of the patterns. You'll also find more info about this in your PRODIGY_README.html. We should probably add a note about this to our text classification video!
The change makes textcat.teach consistent with ner.teach, and gives you more flexibility for the seed terms. Instead of just terms, you can now also specify token patterns, similar to the patterns for spaCy's rule-based Matcher. For example:
This will match all tokens whose lowercase form equals "able". You can also write more complex rules that take into account part-of-speech tags or dependency labels – which sounds like it might help a lot for your use case.
You then then save the patterns as a .jsonl file and load them in via the --patterns argument:
textcat.teach names en_core_web_sm /Users/student/Downloads/out.txt --label NAME --seeds names_seed
As you’ve said about the --seeds argument is renamed but if I just replace it by --patterns it is not able to run the line since it’s searching for a json file. How am I able to use the terms dataset within prodigy?
The patterns are much more flexible, because they allow you to describe multi-word phrases and token attributes, similar to the patterns for spaCy's rule-based matcher.
For example, let's say you're classifying news articles and you want to train a model to detect whether a text is about a company sale. Good indicators of this might be words like "acquired", "acquires", "aquire" etc. Instead of using all possible forms as seed terms, you could write a pattern that matches all words that have the lemma "aquire":
You can also incorporate other token attributes (lowercase form, shape, alphanumeric vs. numbers), as well as statistical predictions like the part-of-speech tags, dependency labels and entity types. The following pattern will match a token with the lemma "be" ("am", "is", "was" etc.), followed by a determiner (an article, e.g. "the", "a"), followed by a noun:
My colleague and I are also new to Prodi.gy and following along with the Insults Classifier demo. Your explanation here to replace seeds with patterns helped us keep moving through the demo. Question: in the demo you mention that the “via_seeds” parameter is sometimes on while annotating. Now that we’re using the patterns paradigm instead, is there a way to tell that the patterns/seeds are being used to select stuff to annotate? I ask because it feels like a lot of the annotations I’m doing are not at all related to the original seed terms I created. Thank you!
Yes, it should say “Pattern” and the pattern ID (line number of the pattern in your patterns file) in the bottom right corner now. If you’re only seeing the score and no patterns reference, the suggestion comes from the model.
If you’re not seeing any pattern matches, the most common explanation is that the patterns don’t match – for example, because the tokens they describe don’t match the actual tokenization. A quick way to test your patterns is to use a recipe like ner.match, wich will just stream in the matches with no model suggestions. If nothing is matched, the patterns don’t apply. You might also find our matcher demo useful (see link above), which lets you build and test match patterns interarctively.