The idea behind the sense2vec
trick might work, but unfortunately the current pre-trained model only provides support for English.
I'm about to propose another trick that might work, but I want to be careful that I don't over-promise anything since I don't speak Chinese. But one thing that you might do is create a spaCy script that can fetch "chunks of tokens that might form a noun". For English, this is supported directly via Doc.noun_chunks
but I believe this isn't supported for Chinese. You might, however, try to construct something similar by hand.
Here's how I might construct it for English using noun_chunks
.
import spacy
nlp = spacy.load("en_core_web_md")
doc = nlp("Pepperoni pizzas are an amazing Italian dish.")
for chunk in doc.noun_chunks:
print(chunk)
# Pepperoni pizzas
# an amazing Italian dish
Here's another way of doing something similar without using the .noun_chunks
property.
from spacy import displacy
displacy.render(doc)
We could find chunks manually by looking for noun
tokens that are a "root". That is to say, we're looking for nouns have that children in the dependency graph.
Here's a little script that can do that.
for tok in doc:
if tok.pos_ == "NOUN":
children = list(tok.children)
if children:
token_idx = [tok.i] + [t.i for t in children]
print(doc[min(token_idx): max(token_idx) + 1])
# Pepperoni pizzas
# an amazing Italian dish
There are variants of this script you might consider, but this is a way to fetch multi-token chunks from your corpus. And this might also work for Chinese.
Here's an example I made using Google translate.
import spacy
nlp = spacy.load("zh_core_web_sm")
doc = nlp("意大利辣香肠比萨饼是一道很棒的意大利菜")
for tok in doc:
if tok.pos_ == "NOUN":
children = list(tok.children)
if children:
token_idx = [tok.i] + [t.i for t in children]
print(doc[min(token_idx): max(token_idx) + 1])
# 意大利辣香肠
# 是一道很棒的意大利菜
From here, you might even be able to construct phrases with vectors from the spaCy pipeline.
import spacy
nlp = spacy.load("zh_core_web_sm")
doc = nlp("意大利辣香肠比萨饼是一道很棒的意大利菜")
phrases = {}
for tok in doc:
if tok.pos_ == "NOUN":
children = list(tok.children)
if children:
token_idx = [tok.i] + [t.i for t in children]
phrase = doc[min(token_idx): max(token_idx) + 1]
phrases[phrase.text] = phrase.vector[:5]
This will give a dictionary, phrases
, that maps phrase texts to vectors.
{
'意大利辣香肠': array([-0.217975 , -1.4146296 , 1.3613806 , -0.09676328, -0.1946054, ...],dtype=float32),
'是一道很棒的意大利菜': array([-0.19973822, 0.54612845, 0.09049363, -0.17543283, 0.31415954, ...], dtype=float32)
}
These vectors can then be used to find phrases that are similar. This would involve custom code, but might be worth a try.
Again, I really want to stress that I cannot judge if this will work for Chinese, and a lot of the utility will depend on the spaCy pipeline. But I'm mentioning it because the exercise does seem to be worth a try. Could you let me know if this direction does/not work for you? I am very much interested in hearing your reply!