Entity resolution with Prodigy

Hi there, I’m interested in using Prodigy for entity resolution on a table of 200k rows of product details from 5 different sources that includes many duplicates. The dataset is quite like this example data set here.

There’s a neat article that describes an approach to combining Prodigy with the dedupe package. I’ve also gone through this tutorial which explains how to use dedupe directly.

However, I’m interested in a Prodigy + spaCy only approach. One simplistic idea would be to combine the relevant columns into one string (in my case, brand name, product name, product description, price, category, subcategory), and then use SimilarityMatcher between each entity and every other entity, save the relevant matches, and then create a custom Prodigy recipe that shows the highest similarity scores and asks to confirm if they are the same entity or not. This might be quite slow, and also is probably not the most intelligent approach… can anyone think of a more direct approach?

Hope everything’s been well at spaCy IRL … I hope I can attend the next one!

EDIT: This gist looks like a pretty good place to start with integrating the best of dedupe and the best of Prodigy, I will try to implement this over the next month for our scenario.

EDIT EDIT: This recipe from prodigy-contrib is pretty close to what we’re after!

Hi! I honestly think that the dedupe approach sounds the most promising – it’s functionality that spaCy doesn’t natively have, so it does make sense to use a third-party library for that.

The similarity approach could work, but it does seem more like a hack. And if you go for that approach, you’d probably want to train word vectors on your raw text, which introduces another step. And it’s really difficult to say whether it’s going to work or not.

1 Like