terms.train-vectors recipe takes the data source (ideally, lots of text) and will train vectors on that source, reflecting the use of the words in context. It doesn’t really care what those words are – it will simply assign the meaning representations.
If you’re interested in extracting brand names later on, you probably want to set the
--merge-nps flag when you train the vectors. This will merge noun phrases into one token, so you’ll end up with more meaningful vectors for names that consist of more than one token. For example, you’ll want a vector for “Coca Cola”, not two vectors for “Coca” and “Cola”.
prodigy terms.train-vectors /path/to/brand-model your_data.jsonl --spacy-model en_core_web_sm --merge-nps
You can then run
terms.teach using your trained vectors and seed terms, for example:
prodigy terms.teach brand_names /path/to/brand-model --seeds "Coca Cola, Nike, McDonalds"
Prodigy will look at the model’s vocabulary, and will try to find other terms that are similar to your seed terms “Coca Cola, Nike, McDonalds”. As you click through examples and accept and reject them, the target vector will be updated, so Prodigy can keep suggesting you other terms similar to the seed terms and the ones you’ve accepted (but not like the ones you’ve rejected). If your vectors were trained on enough representative text, you’ll quickly be able to find other brand names, i.e. entries in the vocabulary with similar representations to your target vector.