Hi there, I’m interested in using Prodigy for entity resolution on a table of 200k rows of product details from 5 different sources that includes many duplicates. The dataset is quite like this example data set here.
There’s a neat article that describes an approach to combining Prodigy with the dedupe
package. I’ve also gone through this tutorial which explains how to use dedupe
directly.
However, I’m interested in a Prodigy + spaCy only approach. One simplistic idea would be to combine the relevant columns into one string (in my case, brand name, product name, product description, price, category, subcategory), and then use SimilarityMatcher
between each entity and every other entity, save the relevant matches, and then create a custom Prodigy recipe that shows the highest similarity scores and asks to confirm if they are the same entity or not. This might be quite slow, and also is probably not the most intelligent approach… can anyone think of a more direct approach?
Hope everything’s been well at spaCy IRL … I hope I can attend the next one!
EDIT: This gist looks like a pretty good place to start with integrating the best of dedupe and the best of Prodigy, I will try to implement this over the next month for our scenario.
EDIT EDIT: This recipe from prodigy-contrib
is pretty close to what we’re after!