Using terms.teach with Japanese

When I use terms.teach with English words, it works very well, very quickly suggesting to me relevant words in the web app.

e.g. using the follwing code brings up the web server and I get many food related terms

python -m prodigy dataset food_seeds_en "seeds for foods english"
python -m prodigy terms.teach food_seeds_en en_core_web_lg --seeds "carrot, spinach, pasta, soba, udon, pizza, Bibimbap, gyudon"

However, when I do the same but using the Japanese language model and Japanese language seeds, all the terms shown to me in the web app (as shown below) are in latin letters not Japanese words. Why am I getting these results? Is the language model not registering properly?

python -m prodigy dataset food_seeds "seeds for foods"
python -m prodigy terms.teach food_seeds ja_core_news_lg --seeds "人参, γ»γ†γ‚Œγ‚“θ‰, パスタ, そば, うどん, ピア, ビビンバ, 牛丼"

Hi! It looks like ths problem is that the terms.teach recipe filters the vocabulary entries it suggests like this:

lexemes = [lex for lex in stream if lex.is_alpha and lex.is_lower]

In Japanese, it turns out that all entries return False for is_lower – which delegates to Python's built-in string method islower(). Because there are no lowercase strings, you only end up seeing some really random entries in latin characters that are lowercase.

As a quick workaround, you can run prodigy stats to find the location of your Prodigy installation and edit that line in recipe/terms.py. If you take out the and lex.is_lower part of the condition, you should see Japanese suggestions from the vectors.

1 Like

That is super helpful, Thanks!

1 Like