Hi! I'm not 100% sure I understand the question correctly because the general suggestion here has always been the same. One idea for bootstrapping a text classifier is to use matches to pre-select examples that contain certain words and mixing them in with the model's suggestions. This can often help move the model in the right direction.
The patterns don't have to cover everything because you're still teaching the model to generalise and annotating other examples without matches. But they can help select more relevant examples, especially if you have lots of raw data and very imbalanced classes. It might be less useful for smaller datasets with more balanced classes.
There are different ways to create patterns. One is to use word vectors to suggest similar words to some seed terms you come up with (which will only be single words). There are different approaches to training vectors for multi-word expressions, for example what we did in
sense2vec: GitHub - explosion/sense2vec: 🦆 Contextually-keyed word vectors The role of the vectors here is only to help you find similar words. You can also use other resources, like existing terminology lists.
The use of word vectors to bootstrap terminology lists shouldn't be confused with the use of vectors as embeddings to initialise a model – the idea here is to just start with more meaningful token embeddings and boost your accuracy. Transformer embeddings like BERT etc. fulfil a similar role.