I purchased Prodigy this morning and have excitedly watched most of the videos, read the docs, and combed through related support posts. I haven’t been so thrilled about a new framework in a while!
I’ve started to structure what I think our pipeline will be, and I wanted to see if I could get some feedback on my approach.
We have a large (40 million rows) corpus of data that looks something like this (where the first line is the headers):
index, Description
1, iPhoneX by Apple free $ off with coupon This new iPhone is the best one you could imagine and we have it here for free it includes 48mb of RAM and connects to the internet with 30mb/s speeds
2, Samsung Edge 34 <$NAME> by Samsung smart phone
3, IPHONE 8 BATTERY PACK official designed by APPLE
4, google PIXEL 3 2016 model includes headphones 32gb ram
5, smartphone with wi-fi capabilities, $400 in store only,
6, FREE FREE FREE
Our preliminary goal is to extract brand/company names (Apple), our secondary goal is to extract product names (iPhone X). There are many lines like 5 above which are an ‘unbranded’ product and we’d like to count them as unbranded, and then some lines like 6 which are junk / impossible to identify.
For our given scenario, we know that there are approximately 3000 unique brands and 40,000 unique products.
We have a starting annotated dataset of 14,000 brand name variations and keywords (i.e. APPLE
, by Apple
, Apple
, iPhone
) mapped to the canonical brand name (Apple, Inc.
). This is not a complete list, however, and we know there are more brands to be discovered.
We have a starting lossy dataset of 40,000 or so product names, although these are not mapped to brand names and there are many not-exact-match duplicates (considering using the helpful python dedupe library, but realize there’s potential to use prodigy here as well).
To start with, I am thinking roughly 3 phases:
- annotation task to discover brands that we don’t know about and train a ‘brand finder’ model
- annotation task to further build upon our “lexicon” of available brand names and identify their variations (again
APPLE Inc.
being a keyword that points to the canonicalApple, Inc.
) - subset the larger dataset into smaller groupings by brand name and repeat steps 1. and 2. but for product names (i.e.
iPhone
orPixel
)
I have thought of bootstrapping the phase 1 with our existing annotated keywords/brand name pairs with a dataset like so:
{
"pattern": [{"orth":"APPLE Inc."}]
"label": "BRAND",
}
{
"pattern": [{"orth":"SAMSUNG"}]
"label": "BRAND",
}
{
"pattern": [{"orth":"Samsung"}]
"label": "BRAND",
}
And then for Phase 2, like so:
{
"text": "SAMSUNG",
"label": "Samsung, Inc.",
"spans": [{"start": 0, "end": 7, "label": "BRAND"}],
"answer": "accept"
}
I’m not sure if my approach is the best, just wrapping my head around the possibilities and looking for more expert opinions or alternative approaches.
Additionally, it seems like using a NER model for step 1 (and 2??) and using ORG
instead of creating our own construct, BRAND
may be advantageous, but I’m not sure.
Thanks for any replies!