Extending `en_core_web_lg` word vectors with Prodigy

Hi there,

I’m using .similarity in spaCy to attempt to tag entities using a category list. I have a 1500-items long list of category strings like:

Rap, Music, Music & Arts
Bakery & Sweets, Food, Food & Drink
...

I’m then doing something like nlp('Kanye West').similarity(category_string) for each category option and then choosing the highest similarity score from the list of category strings as my category.

It actually works pretty well about 80-90% of the time – en_core_web_lg usually knows that Kanye West is most certainly associated with Rap, Music, Music & Arts, etc.

However, there are a number of times where it doesn’t do such a great job, for example:
Matched data: Blood In Blood Out - TV, Movie | Blood In Blood Out to value: Disease, Health, Health & Sports (easy mistake!)

or: Matched data: Catcher In The Rye | Catcher In The Rye to value: Antique, Home & Garden, Home & Community

or: Matched data: CSI | to value: Video Games, Entertainment, Music & Arts (so close!)

I have two thoughts as to how to correct this with Prodigy:

  1. Often times, it looks like one of the other options in the top 10 “highest similarity” is the correct one, could I create a Prodigy task to display these options and then annotate which one is the correct one and “extend” the word vector knowledge of en_core_web_lg …? Am I even explaining this correctly?

  2. Could I create a Prodigy task to build out my own dictionary of “enrichment” tags, via an interface where the annotator is shown ‘Catcher in the Rye’ and asked to manually add additional strings like “book reading literature”, which will then be added just prior to the similarity calculation (e.g. similarity calc would see Catcher in the Rye book reading literature) to help guide the similarity calculation?

Anyway, I’m certain there’s a smarter approach entirely, but this approach seems to be very productive, if only I could push it a little bit further, accuracy-wise!

It sounds like you’re off to a great start. I think what you want to do is use this word-vector-based initial model you have to train a more powerful text classification system. The word vector model you’re using at the moment will treat Kanye West the same as West Kanye, so it’s not such a powerful approach. You should be able to train a model that’s sensitive to longer phrases pretty easily though.

You can definitely do what you suggest in 1): Build a recipe where you use the model to suggest some options. All you need to do is add an entry options to the tasks. You can compute these options in your stream generator, so it’s easy to adapt them on the fly.

What you suggest in 2 is a bit more difficult to implement. We opted not to make this an easy, obvious solution because it’s usually not a great approach. You’ll get a lot of variation in what people type, and people’s choices will be heavily influenced by the specific run of examples they happen to see. The set of categories you classify into is very important. You need a classification scheme that divides up the concept-space you’re working in as neatly as possible. You’ll always have a few boundary cases, but some maps will divide things up better than others.

My recommendation would be to start with a recipe that sorts the text by similarity to a category, and ranks them by highest scoring. Click accept/reject through those, possibly stopping when your acceptance rate gets too low. This should be pretty quick, because you should be able to spam “accept” at the start of annotation, and spam “reject” towards the end. At the end, when everything’s reject, I normally click so fast that it all flashes by, and if something looks wrong I can undo 5 or 6 items I’ve clicked through to fix the error.

Once you’ve got your accept/reject/ignore annotations, you can use the textcat.batch-train recipe to train a more powerful text classifier. You can do this on just one category and see how accurate it is, and possibly use textcat.teach to add some more examples from the data you didn’t annotate before.

After you’ve done this for a number of categories, you can merge the datasets together, so that you can train a single text classification model that can predict all the categories you’re interested in.