Equal Weights for token variation

spacy
usage

(Madhu Jahagirdar) #1

In my corpus, if I have x-ray and xray and x ray spread out in different distribution pattern, but they all means the same, how is that I can let the batch.train know and ensure that they get the same weighting in the vectors space and near to each other.


(Justin Du Jardin) #2

Perhaps this doc on customizing vectors will help. I’m not sure if it does exactly what you want, but it merges similar tokens together to reduce the size of the vocabulary. From the site:

Vocab.prune_vectors reduces the current vector table to a given number of unique entries, and returns a dictionary containing the removed words, mapped to (string, score) tuples, where string is the entry the removed word was mapped to, and score the similarity score between the two words.

{
    'Shore': ('coast', 0.732257),
    'Precautionary': ('caution', 0.490973),
    'hopelessness': ('sadness', 0.742366),
    'Continous': ('continuous', 0.732549),
    'Disemboweled': ('corpse', 0.499432),
    'biostatistician': ('scientist', 0.339724),
    'somewheres': ('somewheres', 0.402736),
    'observing': ('observe', 0.823096),
    'Leaving': ('leaving', 1.0)
}

If you want to look at the code, it’s here on Github


(Madhu Jahagirdar) #3

Would sense2vec be useful in anyway? I checked the online demo of sense2vec and I put x-ray, x ray, xray and it found then near each other and looks very promising.


(Matthew Honnibal) #4

All of the models use the NORM attribute to represent the text of the word, along with the PREFIX, SUFFIX and SHAPE subword features. If you have these variations, you can map them to the same normalized form of the string through this attribute, which doesn’t need to match the literal input text. See here: https://spacy.io/usage/adding-languages#norm-exceptions

Making x ray have the same string requires an extra step, because you’ll need to merge the two tokens into one. You can do this by adding a Matcher as the first component of your pipeline, with an on_match callback that merges the tokens into one.

Mapping the words to the same vector as @justindujardin suggests is also a great step, because the pre-trained word vectors are keyed by the ORTH attribute, which is going to remain distinct (as this reflects the original text, you can’t change it). However the vector mapping only affects the pre-trained vectors. To really get the same features for the word, you want to make sure they have the same NORM attribute.


(Madhu Jahagirdar) #5

I did try prune vector and got the following issues:

Vocab size: 1071545

nlp = spacy.load('/Users/philips/Development/BigData/RS/annotation/Prodigy/followup_report_3M_model/')
print(len(nlp.vocab))

nlp.vocab.prune_vectors(102400)

nlp.to_disk('/Users/philips/Development/BigData/RS/annotation/Prodigy/followup_report_3M_model_pruned')
/Users/philips/PycharmProjects/FollowupDetection/PruneVectors.py:8: RuntimeWarning: invalid value encountered in true_divide
  nlp.vocab.prune_vectors(102400)
/Users/philips/Development/BigData/RS/annotation/venv/lib/python3.5/site-packages/numpy/core/_methods.py:26: RuntimeWarning: invalid value encountered in reduce
  return umr_maximum(a, axis, None, out, keepdims)

am i missing anything?


(Matthew Honnibal) #6

@madhujahagirdar Seems like a bug in spaCy, sorry! I didn’t add an epsilon for numeric stability to the division here: https://github.com/explosion/spaCy/blob/master/spacy/vectors.pyx#L250

You could do nlp.vocab.vectors.data += 1e-12 to work around this until the bug is fixed.


(Madhu Jahagirdar) #7

I got it working, however, it brings interesting problem as you have earlier mentioned and some more. This works for pre-trained vectors (as you have earlier mentioned) and we need to do Mapping for NORM attribute during classification (for non pre-trained values). Now how do we choose the NORM value that we need to map to, for example see below and additionally what I found is, as the classification takes in to account other surrounding values like . \n it becomes a hard choice to make. Is there any process, using remapped list to achieve a right NORM value or i need to follow a different path then the below.

nlp = spacy.load(’/Users/philips/Development/BigData/RS/annotation/Prodigy/followup_report_3M_model_pruned/’)

nlp.tokenizer.add_special_case(“x-ray”, [{ORTH: ‘x-ray’, NORM: ‘x-ray’}])
nlp.tokenizer.add_special_case(“xrays”, [{ORTH: ‘xrays’, NORM: ‘x-ray’}])
nlp.tokenizer.add_special_case(“x-rays”, [{ORTH: ‘x-rays’, NORM: ‘x-ray’}])