terms.teach bigrams returning noisy results

neal · September 28, 2018, 5:36pm

Hi!

I’m trying to use terms.teach with a model I built with FastText. Below are the steps:

Ran Gensim Phrases on text to produce bi/trigrams and output with “_” in between tokens, so “new york city” -> “new_york_city”
output all word vectors (uni/bi/trigrams)
used python -m spacy init-model datasetname /path/to/newmodel -v /path/to/all_terms.ft.vec to initialize a spacy model
Run python terms.teach dataset /path/to/newmodel --seeds seeds.txt where seeds.txt contains a single term that if definitely in /path/to/all_terms.ft.vec
Runs great, but doesn’t return very many bi/trigrams

So, I want to see how the bi/trigrams perform on similarity, so I do the following:

grep /path/to/all_terms.ft.vec "_" > /path/to/ngrams.vec # only look at the bi /tri grams
get /path/to/ngrams.vec line count and prepend it the file for proper word2vec format
run python -m spacy init-model en /path/to/en_ngram_sm -v /path/to/ngrams.vec
everything runs fine, and I can verify that the seed terms are located in the vocab when loading the model and doing nlp.vocab.has_vector()
Run prodigy terms.teach datasetname /path/to/en_ngram_sm --seed seeds.txt, and now the results are noise. I keep receiving unigrams (which shouldn’t even be in the model), like ‘ought’

I checked the models strings.json file and there are a lot of vocabulary terms that I didn’t have in the initial ngrams.vec file.

A couple thoughts occured as to why this would happen:

Using “_” in the seed terms (although this seems highley unlikely
Inclusion of the default unigram terms in the vocab affect the results somehow

so, essentially I created a lot of OOV vectors, but can’t get them to return in prodigy

honnibal · October 2, 2018, 11:14am

Hm! You’ve really done everything right here, so there’s definitely a usability problem here. Somewhere inside spaCy we’re adding default terms, although I can’t see exactly where this would be. This is obviously messing things up for you.

As a mitigation, perhaps you could edit the prodigy.terms.teach recipe, to filter the questions there? This way you could force it to ask you about only phrases, or perhaps encode a length bias into the score, to give it a more soft encouragement to ask you phrase questions.

neal · October 2, 2018, 4:05pm

Thank you, Matthew. Could you clarify how I would edit terms.teach? Would I edit it in /usr/local/lib/ files?

neal · October 2, 2018, 5:10pm

Also, is there a way that I can just incorporate gensim’s KeyVectors most_similar function if this is a spacy issue?

ines · October 3, 2018, 9:27am

The source of the built-in recipes is shipped with Prodigy, so to edit terms.teach, you can look at prodigy/recipes/terms.py.

The easiest way to find the location of your Prodigy installation is to run the following command:

python -c "import prodigy; print(prodigy.__file__)"

neal · October 4, 2018, 9:48am

Thank you for your help.

The problem was coming lex.is_alpha condition when gathering terms in the stream_scored function of terms.py. this will strip out any merged terms with _ binding them. Because I had nothing put terms with _ in them, they all got dropped. I think this will also drop if the terms contain numbers too, correct?

Matthew mentioned the sense2vec to me. Perhaps I should try the phrase maker instead?

honnibal · October 5, 2018, 10:51am

@neal Thanks for figuring out that asterisk problem!

You should probably make a custom recipe for yourself based on the terms.teach example, so you can version your code properly etc. We’ve been thinking about how to get the right balance in our tutorials between getting people started on the built-in recipes, and when to advise them to switch over to their own recipe files.

The thing is, we’ll never be able to add enough options to the built-in recipes to cover every requirement. At some point, the code itself becomes a much better API. I know this is how I like to use tools myself, a lot of the time. At some point the tool is complex enough that I’m learning a mini configuration language…And at that point, I’d rather just be working in Python. spaCy follows a similar philosophy as well, which I sometimes flippantly state as “Let them ~~eat cake~~ write code”. Hopefully it’s not unpopular enough to get me beheaded .

Topic		Replies	Views
Bad results with terms.teach terms , solved	12	2228	August 26, 2020
Error while running terms.teach (E018) spacy , terms , solved	14	2178	September 5, 2021
Web UI for pre-trained Chinese vectors spacy , terms	6	1550	August 22, 2018
terms.teach not working for nightly spacy , nightly	3	538	April 25, 2021
Error when adding seed terms to terms.teach done , terms , solved	8	1990	September 5, 2021

terms.teach bigrams returning noisy results

Related topics