Hi! I have a long list of seeds that I tried to use with sense2vec.teach. However, many of my terms didn't come up (even in the large reddit dataset). The current approach bonks at the first term not found. So I remove it from my list, then run again. Rinse & repeat until I either get tired and just take what I've got or until I make it to the end of my term list.
It would be very handy if instead of stopping at the first term not found, all terms were checked and then reported on. For example, if I use --seeds "A, B, C, D, E, F" and B-E aren't found, instead of me running 5 times, with a message in 4 of those about a single seed, I could run just twice. The first time, I'd be told ✘ Can't find seed terms: 'B', 'C', 'D', 'E'.
Thanks, that's a good point! I think the intention here was to exit as early as possible, but I can see how this is really inconvenient in cases like this. Looking at the code again, I even wonder if we should make this a warning instead and just skip all terms that are not in the vectors, and maybe only raise and exit if there are none left.
In the meantime (and in case others come across this thread later), here's a quick script to prune a longer list of seed terms:
from sense2vec import Sense2Vec
seeds = ["A", "B", "C", "D", "E", "F"]
s2v = Sense2Vec().from_disk("/path/to/s2v")
pruned_seeds = []
for seed in seeds:
key = s2v.get_best_sense(seed)
if key is not None:
pruned_seeds.append(seed)
Thanks, Ines! I agree that a warning here as long as there’s at least one seed left is probably the way to go. I’m definitely going to take advantage of your workaround in the meantime.