Hi all I'm a Prodigy n00b and proud owner and supporter of Prodigy, with a little background into NLP.
My first business problem is company name matching. Let's assume we have 100,000 companies from two separate datasets, from different industries, so canonicalisation is unlikely to work.
Is Prodigy the tool for this? Or should I look at other methods. The concern I have is that through the lens of entity matching this would require a very large number of classes (one per company) and that may not be practical.
Hi! So just to make sure I understand the use case correctly: Do you already have labelled examples or word lists for the comany names you're looking for, or do you also need to find and detect what's a company name, and then resolve them to a distinct ID? For instance, detect that both apple and Apple, Inc in certain contexts are both mentions of the company Apple, Inc.?
It sounds like you have several problems to solve here that might make sense to tackle as separate machine learning tasks:
Finding all company names. This is a more generic named entity recognition task and you can probably take advantage of existing pretrained models or match patterns to help you with the annotation. Workflows you can use in Prodigy are ner.manaul (with --patterns to pre-highlight terms from a list for you), or ner.correct with a pretrained model that highlights ORG entities for you so you only need to correct it. At the end of it, you can train a model that should hopefully be able to find you all mentions of company names with decent accuracy.
Named entity disambiguation and or entity linking. Given a company name, you want to know the instance of the real-world object it maps to – for instance, an ID in your knowledge base. As you already noted in your post, solving this as a sequence tagging problem with one label per company isn't feasible. You typically want to predict IDs for given entity spans predicted by your entity recognizer. You can also create training data for this in Prodigy, by going over the ORG entities predicted by the model and selecting the correct knowledge base entry from a set of candiates.