Applying terms.teach for Chinese

Hi! Can I check for the recipe "terms.teach", does it matter if the initial seed list are in traditional or simplified chinese? Will the results differ?

Also, when i reject a term suggested by the model, does the model perform any "negative" scoring on that term so that subsequent terms suggested will take into account those irrelevant terms?

Thanks!

Welcome back @jsnleong :slight_smile:

If you use the spaCy Chinese model (e.g. zh_core_web_lg) the results will differ in that for simplified Chinese you'll most likely receive simplified Chinese suggestions and the other way round, but the overall suggestions should belong to the same semantic space regardless of the variant.

This is because the simplified and traditional tokens are treated as separate tokens in training and, consequently, they are represented by separate word vectors.

To illustrate, you can run this small experiment:

import spacy

nlp = spacy.load("zh_core_web_lg")

simplified = nlp("书") # book 
traditional = nlp("書") # book

print(f"Simplified has vector: {simplified.has_vector}")
print(f"Traditional has vector: {traditional.has_vector}")
print(f"The vectors are the same: {simplified.vector_norm==traditional.vector_norm}")

# Output
# Simplified has vector: True
# Traditional has vector: True
# The vectors are the same: False

Now, if you compared the output of terms.teach for these terms, you'd see that the exact suggestions are different, but the semantic space is the same for the simplified seed and the traditional seed:

simplified result translation     traditional result translation
本书               book            書裡                book
书来               book comes      書籍                books
此书               this book       本書                book
书上               book            書當                Shudang
书后               back of book    書中                book

Regarding the rejected terms, yes the model takes the rejected terms into account by iteratively updating the negative_vector, which is then used to compute the similarity between the candidate term and the negative_vector - the reject_score. This reject_score is then used in computing the final score for a term using this formula:

score = accept_score / (accept_score + reject_score + 0.2)
1 Like

Hi!

Thanks! You gave a very clear and thorough explanation :slight_smile:

Riding on the previous point about the simplified vs trad chinese, can I assume that by using traditional chi in my initial seed list, I would be focusing more on data sources (within spaCy's lang model) trained in trad chinese?

That's correct, yes. (Glad I could help :slight_smile: )