Input pattern file to terms.teach

The idea behind the sense2vec trick might work, but unfortunately the current pre-trained model only provides support for English.

I'm about to propose another trick that might work, but I want to be careful that I don't over-promise anything since I don't speak Chinese. But one thing that you might do is create a spaCy script that can fetch "chunks of tokens that might form a noun". For English, this is supported directly via Doc.noun_chunks but I believe this isn't supported for Chinese. You might, however, try to construct something similar by hand.

Here's how I might construct it for English using noun_chunks.

import spacy 

nlp = spacy.load("en_core_web_md")

doc = nlp("Pepperoni pizzas are an amazing Italian dish.")
for chunk in doc.noun_chunks:
    print(chunk)

# Pepperoni pizzas
# an amazing Italian dish

Here's another way of doing something similar without using the .noun_chunks property.

from spacy import displacy 

displacy.render(doc)

We could find chunks manually by looking for noun tokens that are a "root". That is to say, we're looking for nouns have that children in the dependency graph.

Here's a little script that can do that.

for tok in doc:
    if tok.pos_ == "NOUN":
        children = list(tok.children)
        if children:
            token_idx = [tok.i] + [t.i for t in children]
            print(doc[min(token_idx): max(token_idx) + 1])
# Pepperoni pizzas
# an amazing Italian dish

There are variants of this script you might consider, but this is a way to fetch multi-token chunks from your corpus. And this might also work for Chinese.

Here's an example I made using Google translate.

import spacy

nlp = spacy.load("zh_core_web_sm")
doc = nlp("意大利辣香肠比萨饼是一道很棒的意大利菜")

for tok in doc:
    if tok.pos_ == "NOUN":
        children = list(tok.children)
        if children:
            token_idx = [tok.i] + [t.i for t in children]
            print(doc[min(token_idx): max(token_idx) + 1])
# 意大利辣香肠
# 是一道很棒的意大利菜

From here, you might even be able to construct phrases with vectors from the spaCy pipeline.

import spacy

nlp = spacy.load("zh_core_web_sm")
doc = nlp("意大利辣香肠比萨饼是一道很棒的意大利菜")

phrases = {}
for tok in doc:
    if tok.pos_ == "NOUN":
        children = list(tok.children)
        if children:
            token_idx = [tok.i] + [t.i for t in children]
            phrase = doc[min(token_idx): max(token_idx) + 1]
            phrases[phrase.text] = phrase.vector[:5]

This will give a dictionary, phrases, that maps phrase texts to vectors.

{
   '意大利辣香肠': array([-0.217975  , -1.4146296 ,  1.3613806 , -0.09676328, -0.1946054, ...],dtype=float32),
   '是一道很棒的意大利菜': array([-0.19973822,  0.54612845,  0.09049363, -0.17543283,  0.31415954, ...], dtype=float32)
}

These vectors can then be used to find phrases that are similar. This would involve custom code, but might be worth a try.

Again, I really want to stress that I cannot judge if this will work for Chinese, and a lot of the utility will depend on the spaCy pipeline. But I'm mentioning it because the exercise does seem to be worth a try. Could you let me know if this direction does/not work for you? I am very much interested in hearing your reply!