Applying terms.teach for Chinese

jsnleong · January 10, 2024, 12:40am

Hi! Can I check for the recipe "terms.teach", does it matter if the initial seed list are in traditional or simplified chinese? Will the results differ?

Also, when i reject a term suggested by the model, does the model perform any "negative" scoring on that term so that subsequent terms suggested will take into account those irrelevant terms?

Thanks!

magdaaniol · January 10, 2024, 1:14pm

Welcome back @jsnleong

If you use the spaCy Chinese model (e.g. zh_core_web_lg) the results will differ in that for simplified Chinese you'll most likely receive simplified Chinese suggestions and the other way round, but the overall suggestions should belong to the same semantic space regardless of the variant.

This is because the simplified and traditional tokens are treated as separate tokens in training and, consequently, they are represented by separate word vectors.

To illustrate, you can run this small experiment:

import spacy

nlp = spacy.load("zh_core_web_lg")

simplified = nlp("书") # book 
traditional = nlp("書") # book

print(f"Simplified has vector: {simplified.has_vector}")
print(f"Traditional has vector: {traditional.has_vector}")
print(f"The vectors are the same: {simplified.vector_norm==traditional.vector_norm}")

# Output
# Simplified has vector: True
# Traditional has vector: True
# The vectors are the same: False

Now, if you compared the output of terms.teach for these terms, you'd see that the exact suggestions are different, but the semantic space is the same for the simplified seed and the traditional seed:

simplified result translation     traditional result translation
本书               book            書裡                book
书来               book comes      書籍                books
此书               this book       本書                book
书上               book            書當                Shudang
书后               back of book    書中                book

Regarding the rejected terms, yes the model takes the rejected terms into account by iteratively updating the negative_vector, which is then used to compute the similarity between the candidate term and the negative_vector - the reject_score. This reject_score is then used in computing the final score for a term using this formula:

score = accept_score / (accept_score + reject_score + 0.2)

jsnleong · January 11, 2024, 3:32am

Hi!

Thanks! You gave a very clear and thorough explanation

Riding on the previous point about the simplified vs trad chinese, can I assume that by using traditional chi in my initial seed list, I would be focusing more on data sources (within spaCy's lang model) trained in trad chinese?

magdaaniol · January 11, 2024, 7:13pm

That's correct, yes. (Glad I could help )

Topic		Replies	Views
Web UI for pre-trained Chinese vectors spacy , terms	6	1550	August 22, 2018
Questions on terms.teach	2	120	February 1, 2024
Error when adding seed terms to terms.teach done , terms , solved	8	1990	September 5, 2021
Bad results with terms.teach terms , solved	12	2228	August 26, 2020
terms.teach bigrams returning noisy results spacy , terms	6	1136	October 5, 2018

Applying terms.teach for Chinese

Related topics