Questions on terms.teach

Hi!

I have been an active user of the recipe "terms.teach" to generate more related terms in the vector space.

However, I understand that - as quoted on your recipe page "Note that this recipe will only iterate over the vocabulary entries and their word vectors, which are typically single tokens." Is this the case for Chinese terms as well?

For sense2vec - does it work on Chinese terms?

Also, saw on the recipe page for sense2vec that "We provide pretrained vectors on Reddit comments (2015 and 2019), as well as scripts to train your own." How am I able to train on my own?

Thanks! Sorry for the multiple questions :stuck_out_tongue:

Hi, any advice on this? :slight_smile:

Thanks!

Hi @jsnleong ,

Is this the case for Chinese terms as well?

That depends what tokenization is used in the spaCy pipeline provided to the recipe. spaCy chinese pipelines supports: character, jieba and pkuseg segmentations.
You can read more about support for Chinese language in spaCy docs.

The latest release of zh_core_web_lg uses pkuseg segmentaion looking at the multi-character tokens it produces. You can also inspect the details of the pipeline by looking at its config.

For sense2vec - does it work on Chinese terms?

Are you asking about the sense2vec vectors released by us or about the algorithm? In either case the answer is probably no (at least not out of the box).
The released vectors were trained on Reddit which is predominantly English. If you try querying Chinese terms in the interactive demo, you'll see there are no results.

As for the algorithm, again, it's been designed mostly for English and leverages syntactic and lexical phenomena typical of English and I can't really say how feasable it is to adapt it to Chinese. We can't really offer guide on that. I recommend you understand a bit more about the algorithm by looking at our blog, the source paper, and our implementation to evaluate the feasibility of the project. This is advice also applies to your last question:

" How am I able to train on my own?

The repository specifies the steps required to recreate the training process: GitHub - explosion/sense2vec: 🦆 Contextually-keyed word vectors
Be mindful, though that all the preprocessing that adds linguistic annotations would have to be adapted to Chinese. One of the linguistic annotations that sense2vec algorithm requries is the iterator over the noun chunks which spaCy Chinese pipeline does not provide which means you would have implement your own.
Another, not trivial aspect to have in mind is the amount of data you'll require. To train meaningful vectors it is recommended to have at least 1 billion word corpus.