Hi! Can I check for the recipe "terms.teach", does it matter if the initial seed list are in traditional or simplified chinese? Will the results differ?
Also, when i reject a term suggested by the model, does the model perform any "negative" scoring on that term so that subsequent terms suggested will take into account those irrelevant terms?
If you use the spaCy Chinese model (e.g. zh_core_web_lg) the results will differ in that for simplified Chinese you'll most likely receive simplified Chinese suggestions and the other way round, but the overall suggestions should belong to the same semantic space regardless of the variant.
This is because the simplified and traditional tokens are treated as separate tokens in training and, consequently, they are represented by separate word vectors.
To illustrate, you can run this small experiment:
import spacy
nlp = spacy.load("zh_core_web_lg")
simplified = nlp("书") # book
traditional = nlp("書") # book
print(f"Simplified has vector: {simplified.has_vector}")
print(f"Traditional has vector: {traditional.has_vector}")
print(f"The vectors are the same: {simplified.vector_norm==traditional.vector_norm}")
# Output
# Simplified has vector: True
# Traditional has vector: True
# The vectors are the same: False
Now, if you compared the output of terms.teach for these terms, you'd see that the exact suggestions are different, but the semantic space is the same for the simplified seed and the traditional seed:
simplified result translation traditional result translation
本书 book 書裡 book
书来 book comes 書籍 books
此书 this book 本書 book
书上 book 書當 Shudang
书后 back of book 書中 book
Regarding the rejected terms, yes the model takes the rejected terms into account by iteratively updating the negative_vector, which is then used to compute the similarity between the candidate term and the negative_vector - the reject_score. This reject_score is then used in computing the final score for a term using this formula:
Thanks! You gave a very clear and thorough explanation
Riding on the previous point about the simplified vs trad chinese, can I assume that by using traditional chi in my initial seed list, I would be focusing more on data sources (within spaCy's lang model) trained in trad chinese?