terms.teach bigrams returning noisy results


(Neal Lewis) #1


I’m trying to use terms.teach with a model I built with FastText. Below are the steps:

  1. Ran Gensim Phrases on text to produce bi/trigrams and output with “_” in between tokens, so “new york city” -> “new_york_city”
  2. output all word vectors (uni/bi/trigrams)
  3. used python -m spacy init-model datasetname /path/to/newmodel -v /path/to/all_terms.ft.vec to initialize a spacy model
  4. Run python terms.teach dataset /path/to/newmodel --seeds seeds.txt where seeds.txt contains a single term that if definitely in /path/to/all_terms.ft.vec
  5. Runs great, but doesn’t return very many bi/trigrams

So, I want to see how the bi/trigrams perform on similarity, so I do the following:

  1. grep /path/to/all_terms.ft.vec "_" > /path/to/ngrams.vec # only look at the bi /tri grams
  2. get /path/to/ngrams.vec line count and prepend it the file for proper word2vec format
  3. run python -m spacy init-model en /path/to/en_ngram_sm -v /path/to/ngrams.vec
  4. everything runs fine, and I can verify that the seed terms are located in the vocab when loading the model and doing nlp.vocab.has_vector()
  5. Run prodigy terms.teach datasetname /path/to/en_ngram_sm --seed seeds.txt, and now the results are noise. I keep receiving unigrams (which shouldn’t even be in the model), like ‘ought’

I checked the models strings.json file and there are a lot of vocabulary terms that I didn’t have in the initial ngrams.vec file.

A couple thoughts occured as to why this would happen:

  1. Using “_” in the seed terms (although this seems highley unlikely
  2. Inclusion of the default unigram terms in the vocab affect the results somehow

so, essentially I created a lot of OOV vectors, but can’t get them to return in prodigy :confused:

(Matthew Honnibal) #2

Hm! You’ve really done everything right here, so there’s definitely a usability problem here. Somewhere inside spaCy we’re adding default terms, although I can’t see exactly where this would be. This is obviously messing things up for you.

As a mitigation, perhaps you could edit the prodigy.terms.teach recipe, to filter the questions there? This way you could force it to ask you about only phrases, or perhaps encode a length bias into the score, to give it a more soft encouragement to ask you phrase questions.

(Neal Lewis) #3

Thank you, Matthew. Could you clarify how I would edit terms.teach? Would I edit it in /usr/local/lib/ files?

(Neal Lewis) #4

Also, is there a way that I can just incorporate gensim’s KeyVectors most_similar function if this is a spacy issue?

(Ines Montani) #5

The source of the built-in recipes is shipped with Prodigy, so to edit terms.teach, you can look at prodigy/recipes/terms.py.

The easiest way to find the location of your Prodigy installation is to run the following command:

python -c "import prodigy; print(prodigy.__file__)"

(Neal Lewis) #6

Thank you for your help.

The problem was coming lex.is_alpha condition when gathering terms in the stream_scored function of terms.py. this will strip out any merged terms with _ binding them. Because I had nothing put terms with _ in them, they all got dropped. I think this will also drop if the terms contain numbers too, correct?

Matthew mentioned the sense2vec to me. Perhaps I should try the phrase maker instead?

(Matthew Honnibal) #7

@neal Thanks for figuring out that asterisk problem!

You should probably make a custom recipe for yourself based on the terms.teach example, so you can version your code properly etc. We’ve been thinking about how to get the right balance in our tutorials between getting people started on the built-in recipes, and when to advise them to switch over to their own recipe files.

The thing is, we’ll never be able to add enough options to the built-in recipes to cover every requirement. At some point, the code itself becomes a much better API. I know this is how I like to use tools myself, a lot of the time. At some point the tool is complex enough that I’m learning a mini configuration language…And at that point, I’d rather just be working in Python. spaCy follows a similar philosophy as well, which I sometimes flippantly state as “Let them eat cake write code”. Hopefully it’s not unpopular enough to get me beheaded :stuck_out_tongue:.