Hi folks,
I’m trying to wrap my head around how the word vectors are used by prodigy and if it’s worth it (or even advisable) to use custom vectors?
I’ve created my own model based on en_core_web_sm but with the word vectors replaced with my own gensim vectors (see here). This seems to work fine for the terms.teach task but based on what I’ve been reading on these boards I’m becoming doubtful that this is useful for other tasks (e.g. textcat.teach). Am I correct that just replacing the word vectors alone is not sufficient since my gensim model almost certainly preprocessed the data differently? I tried my model out on textcat.teach and I was getting really strange results even after several hundred annotations (most scores were either very close to 0.0 or 1.0 and there didn’t seem to be any logical reason for this).
Secondly I’ve been following several recent threads about terms.train-vectors being broken in various ways. Is it even possible to train custom vectors with prodigy at the moment?
The text I’m working with is very domain specific with lots of medical and scientific jargon so I’m concerned that using one of the pre-built spacy models would be a bad fit but I’m not convinced I really am better off building my own at this point.
I’m looking for advice here… should I build my own models with my own domain specific word vectors? If so what’s the least error prone method for doing so? If I should use pre-built models instead, which of the 4 different en models should I use?