Similiarity > 1? How to debug?

pl6306 · February 26, 2020, 3:50pm

I am trying to debug a textcat mis-classification. I am starting with en_core_web_md and then training textcat with my own data. Here is the problem

Repurchased 352,100 shares of our common stock for $4.3 million. (gets mis-classified)
Repurchase 352,100 shares of our common stock for $4.3 million. (just removing the d get classified)

I removed the d from Repurchased and it was properly classified. Then I check the similarity of the two sentences and get something larger then 1?

doc_classified.similarity(doc_mis_classified)
Out[6]: 1.0000000004917455

Something is off. How should I got about debugging it. I also looked at the similarity of Repurchased vs Repurchase and it is 0.9999999467980901 which makes sense.

ines · February 27, 2020, 2:09pm

Hi! Which version of spaCy are you running? It can sometimes happen that the number is greater than 1 (floating point imprecision), but if I remember correctly, the similarity methods should have a condition that just makes it return 1.0 in those cases so the result is less confusing.

pl6306 · February 27, 2020, 7:55pm

I am running version 2.2.3 of spacy. I checked the word vector for Repurchased and Repurchase and they are the same. I don't get why the textcat results in such different probabilities for basically the same sentence?

ines · February 28, 2020, 9:06am

So what are the text classification scores you get for the two sentences?

pl6306 · February 28, 2020, 9:58pm

It went from under 0.9% to over 99% just by changing Purchased to Purchase. spacy_screenshot

pl6306 · March 2, 2020, 4:57am

I tried to debug it a little. I notice that the thinc.neural._classes.feature_extracter.FeatureExtracter results in a slightly different array? Only the second element of the array is different after the FeatureExtracter runs in the predict function for textcat. Not sure if that helps with debugging the issue.

Topic		Replies	Views
Traning/validation in Textcat/ textcat , spacy , off-topic	0	1141	May 26, 2020
Spacy TextCat: Training time increased on minor increase in training instances. textcat , spacy	1	408	February 1, 2020
Slow training on multilabel textcats usage , textcat , spacy	9	792	November 19, 2021
TextCat outcome depends on words that are not in the vocabulary textcat , spacy	3	726	February 11, 2020
Prodigy textcat train optimization?? usage , textcat , spacy	3	497	March 23, 2020

Similiarity > 1? How to debug?

Related Topics