Hi there,
I’m using .similarity
in spaCy to attempt to tag entities using a category list. I have a 1500-items long list of category strings like:
Rap, Music, Music & Arts
Bakery & Sweets, Food, Food & Drink
...
I’m then doing something like nlp('Kanye West').similarity(category_string)
for each category option and then choosing the highest similarity score from the list of category strings as my category.
It actually works pretty well about 80-90% of the time – en_core_web_lg
usually knows that Kanye West is most certainly associated with Rap, Music, Music & Arts, etc.
However, there are a number of times where it doesn’t do such a great job, for example:
Matched data: Blood In Blood Out - TV, Movie | Blood In Blood Out to value: Disease, Health, Health & Sports
(easy mistake!)
or: Matched data: Catcher In The Rye | Catcher In The Rye to value: Antique, Home & Garden, Home & Community
or: Matched data: CSI | to value: Video Games, Entertainment, Music & Arts
(so close!)
I have two thoughts as to how to correct this with Prodigy:
-
Often times, it looks like one of the other options in the top 10 “highest similarity” is the correct one, could I create a Prodigy task to display these options and then annotate which one is the correct one and “extend” the word vector knowledge of
en_core_web_lg
…? Am I even explaining this correctly? -
Could I create a Prodigy task to build out my own dictionary of “enrichment” tags, via an interface where the annotator is shown ‘Catcher in the Rye’ and asked to manually add additional strings like “book reading literature”, which will then be added just prior to the similarity calculation (e.g. similarity calc would see
Catcher in the Rye book reading literature
) to help guide the similarity calculation?
Anyway, I’m certain there’s a smarter approach entirely, but this approach seems to be very productive, if only I could push it a little bit further, accuracy-wise!