Similiarity > 1? How to debug?

I am trying to debug a textcat mis-classification. I am starting with en_core_web_md and then training textcat with my own data. Here is the problem

  1. Repurchased 352,100 shares of our common stock for $4.3 million. (gets mis-classified)
  2. Repurchase 352,100 shares of our common stock for $4.3 million. (just removing the d get classified)

I removed the d from Repurchased and it was properly classified. Then I check the similarity of the two sentences and get something larger then 1?

Out[6]: 1.0000000004917455

Something is off. How should I got about debugging it. I also looked at the similarity of Repurchased vs Repurchase and it is 0.9999999467980901 which makes sense.

Hi! Which version of spaCy are you running? It can sometimes happen that the number is greater than 1 (floating point imprecision), but if I remember correctly, the similarity methods should have a condition that just makes it return 1.0 in those cases so the result is less confusing.

I am running version 2.2.3 of spacy. I checked the word vector for Repurchased and Repurchase and they are the same. I don't get why the textcat results in such different probabilities for basically the same sentence?

So what are the text classification scores you get for the two sentences?

It went from under 0.9% to over 99% just by changing Purchased to Purchase.spacy_screenshot

I tried to debug it a little. I notice that the thinc.neural._classes.feature_extracter.FeatureExtracter results in a slightly different array? Only the second element of the array is different after the FeatureExtracter runs in the predict function for textcat. Not sure if that helps with debugging the issue.