Hi there, I’m interested in using Prodigy for entity resolution on a table of 200k rows of product details from 5 different sources that includes many duplicates. The dataset is quite like this example data set here.
There’s a neat article that describes an approach to combining Prodigy with the dedupe package. I’ve also gone through this tutorial which explains how to use dedupe directly.
However, I’m interested in a Prodigy + spaCy only approach. One simplistic idea would be to combine the relevant columns into one string (in my case, brand name, product name, product description, price, category, subcategory), and then use SimilarityMatcher between each entity and every other entity, save the relevant matches, and then create a custom Prodigy recipe that shows the highest similarity scores and asks to confirm if they are the same entity or not. This might be quite slow, and also is probably not the most intelligent approach… can anyone think of a more direct approach?
Hope everything’s been well at spaCy IRL … I hope I can attend the next one!
EDIT: This gist looks like a pretty good place to start with integrating the best of dedupe and the best of Prodigy, I will try to implement this over the next month for our scenario.
EDIT EDIT: This recipe from prodigy-contrib is pretty close to what we’re after!
Hi! I honestly think that the dedupe approach sounds the most promising – it's functionality that spaCy doesn't natively have, so it does make sense to use a third-party library for that.
The similarity approach could work, but it does seem more like a hack. And if you go for that approach, you'd probably want to train word vectors on your raw text, which introduces another step. And it's really difficult to say whether it's going to work or not.
I know this is an old thread but I wanted to clarify one thing: yes, you can use prodigy for entity resolution. And no, this is not like a hack.
Here is a high-level recipe:
Serialize your records into "sentences" for BERT and optionally do text preprocessing with spacy to end up with "sentences" that are easier digestible by a BERT model
(outside of spacy/prodigy) use SBERT and KNN/ANN for pair candidate selection
use a pre-trained BERT and fine-tune it with examples of matches and no-matches. Note that there is the spacy-transfomers integration.
Configure your prodigy frontend to make the labeling a smooth experience.
(outside spacy/prodigy) apply one of the common graph clustering approaches to resolve conflicts and build a cross-reference tables out of your pair-wise predictions obtained from spacy-transformers models.
Best,
DedupeDude
p.s.: using BERT and other LLMs for entity resolution is has been extensively studied in the entity resolution literature. It typically beats "classic" approaches (e.g. dedupe) in terms of precision and recall by a big margin especially in domains where the records have a free-text-ish nature, for instance, deduplicating product records from product descriptions.