Determining the best annotation pipeline for our scenario

I purchased Prodigy this morning and have excitedly watched most of the videos, read the docs, and combed through related support posts. I haven’t been so thrilled about a new framework in a while!

I’ve started to structure what I think our pipeline will be, and I wanted to see if I could get some feedback on my approach.

We have a large (40 million rows) corpus of data that looks something like this (where the first line is the headers):

index, Description
1, iPhoneX by Apple free $ off with coupon This new iPhone is the best one you could imagine and we have it here for free it includes 48mb of RAM and connects to the internet with 30mb/s speeds
2, Samsung Edge 34 <$NAME> by Samsung smart phone
3, IPHONE 8 BATTERY PACK official designed by APPLE
4, google PIXEL 3 2016 model includes headphones 32gb ram
5, smartphone with wi-fi capabilities, $400 in store only,
6, FREE FREE FREE

Our preliminary goal is to extract brand/company names (Apple), our secondary goal is to extract product names (iPhone X). There are many lines like 5 above which are an ‘unbranded’ product and we’d like to count them as unbranded, and then some lines like 6 which are junk / impossible to identify.

For our given scenario, we know that there are approximately 3000 unique brands and 40,000 unique products.

We have a starting annotated dataset of 14,000 brand name variations and keywords (i.e. APPLE, by Apple, Apple, iPhone) mapped to the canonical brand name (Apple, Inc.). This is not a complete list, however, and we know there are more brands to be discovered.

We have a starting lossy dataset of 40,000 or so product names, although these are not mapped to brand names and there are many not-exact-match duplicates (considering using the helpful python dedupe library, but realize there’s potential to use prodigy here as well).


To start with, I am thinking roughly 3 phases:

  1. annotation task to discover brands that we don’t know about and train a ‘brand finder’ model
  2. annotation task to further build upon our “lexicon” of available brand names and identify their variations (again APPLE Inc. being a keyword that points to the canonical Apple, Inc.)
  3. subset the larger dataset into smaller groupings by brand name and repeat steps 1. and 2. but for product names (i.e. iPhone or Pixel)

I have thought of bootstrapping the phase 1 with our existing annotated keywords/brand name pairs with a dataset like so:

{
    "pattern": [{"orth":"APPLE Inc."}]
    "label": "BRAND",
}
{
    "pattern": [{"orth":"SAMSUNG"}]
    "label": "BRAND",
}
{
    "pattern": [{"orth":"Samsung"}]
    "label": "BRAND",
}

And then for Phase 2, like so:

{
    "text": "SAMSUNG",
    "label": "Samsung, Inc.",
    "spans": [{"start": 0, "end": 7, "label": "BRAND"}],
    "answer": "accept"
}

I’m not sure if my approach is the best, just wrapping my head around the possibilities and looking for more expert opinions or alternative approaches.

Additionally, it seems like using a NER model for step 1 (and 2??) and using ORG instead of creating our own construct, BRAND may be advantageous, but I’m not sure.

Thanks for any replies!

Hey,

Glad to hear the tool resonated with you! I think your project sounds like it’s off to a good start with the types of data you have, and what you want to do should be very achievable. So I hope things can progress well.

I think you should spend some time doing analysis before jumping in to try to train a particular model. For the sake of discussion, let’s name your different resources like this:

  • descriptions.csv: The corpus you gave a snippet of, with 40 million rows.
  • knowledge_base: The table of brands and product names. This will have the canonical names of the companies and brands, maybe some links between entities, etc.
  • company_names: A list of phrases extracted from knowledge_base that should usually refer to a company, e.g. Apple, Amazon, Apple, Inc, etc.
  • brand_names: A list of phrases extracted from knowledge_base that should usually refer to a brand name.
  • patterns.jsonl: A file with patterns to find potential entities and/or products. These can be more abstract than company_names or brand_names, e.g. you might exclude the most ambiguous names, include extra features like part-of-speech restrictions, etc.
  • auxiliary_text: Some collection of arbitrary other text, to use in addition to your target text. Text from Reddit might be useful for this, as the Reddit corpus is large and easy to work with.

Here are some examples of questions you should want a rough idea of before you start making more detailed plans:

  • How many entries in company_names and brand_names actually occur in descriptions.csv? More precisely, you want to know something about P(name_text) for the names in your lists.
  • Which entries in company_names and brand_names are particularly ambiguous? For instance, in a previous project where we were linking text to Wikipedia, we first looked for phrases that exactly matched a page title. This is usually good, but then you have an entity page for the novel It. That’s not such a great phrase to be matching against. More precisely, you want to know which phrases have low P(entity | mention).
  • Are there many company or product mentions in your descriptions.csv that aren’t in knowledge_base? How should these be handled? Even if for the purposes of your downstream logic it’s not interesting to tag these missing entities (because you can’t action the taggings), for the purpose of your pipeline they need to be handled somehow. If there are few of these cases, it’s no problem.
  • Have another look at the process that created descriptions.csv, and check whether you’re discarding any features that could be useful contextual clues. Time/date stamps, source information, author tags etc can all be useful features in preparing training data. You might not want to train your production model on these features, as they might not generalise to the run-time text. But while you’re preparing your project, they can sometimes help you bootstrap the annotation process, for instance if the same entity is mentioned many times on the same day, you may want to annotate those texts together.

I hope that gives you some ideas for first steps. I think you should be able to develop some initial rule-based process, and then use the rules to suggest annotations on the text. This can be a very efficient way to get initial annotations completed.

Another suggestion is, you probably want to avoid trying to have a model discriminate between company vs brand. Mentions that reference the two simultaneously are very common, e.g. Uber the organisation is not strictly the same as the Uber smart-phone application…But many mentions in text will not draw this distinction. Similarly, even the smart-phone application is not just one thing — the Android product is different from the iPhone product. Reality has a surprising amount of detail, and language blurs these irrelevant distinctions.

Thanks greatly for your detailed reply -- especially the way you've chosen to structure the project via the terminology provided above will be super helpful in designing and documenting the pipeline intelligently for this project.

I think this is exactly my plan -- I'm not clear on where to start from a code perspective, although I'm only one day into the documentation so I likely have more experimentation ahead before that's clear to me.

If you just want to get something on the screen quickly, a good place to start could be the ner.match recipe. It takes your data and a patterns.jsonl file with examples of the entities you're looking for and will show you all pattern matches so you can accept or reject them. It should be very quick to collect a few hundred to a thousand annotations like this, and it'll give you a good feeling for what's in your data and how effective your existing rules are.

Once you have a small dataset, you can start running some experiments with ner.batch-train and see if the results look promising. They might not be that conclusive, but if there's a general "up" trend, it's often a good indicator that the modle is able to learn the distinction from the examples you've labelled. Once you have a model artifact, you can start experimenting with it some more – for example, load it into spaCy for som playtesting, or see if you can improve it with an active learning-powered recipe like ner.teach that puts the model in the loop and suggests entities based on what the model is most uncertain about.

1 Like

Trained my first model today :partying_face:

First bootstrapped a model via prodigy ner.teach brand_tagging en_core_web_sm data_1018.jsonl --label BRAND --patterns brand_patterns.jsonl, where brand_patterns.jsonl looks something like:

{'label': 'BRAND', 'pattern': [{'orth': 'APPLE'}]},
{'label': 'BRAND', 'pattern': [{'orth': 'Apple'}, {'orth': 'Inc.'}]},
{'label': 'BRAND', 'pattern': [{'orth': 'AppleInc.'}]},

I annotated for about an hour.

Then, I tested my model to see how it was doing, with prodigy ner.batch-train brand_tagging en_core_web_sm --output brand_tag_alpha_2019_04_25 --label BRAND, resulting in


         
#          LOSS       RIGHT      WRONG      ENTS       SKIP       ACCURACY  
01         44.355     66         72         1265       0          0.478                              
02         30.801     84         54         1488       0          0.609                              03         26.514     91         47         1549       0          0.659                              
04         24.054     96         42         1694       0          0.696                              
05         20.421     102        36         1793       0          0.739                              
06         18.976     106        32         1758       0          0.768                              
07         17.107     109        29         1811       0          0.790                              
08         15.809     109        29         1679       0          0.790                              
09         14.831     114        24         1599       0          0.826                              
10         14.087     115        23         1590       0          0.833                              

which, all-told, looks pretty good to me!

Now, I'm wondering about augmenting the existing annotations data I've just created with "annotations" generated by my lookup table.

Let me explain: by simply iterating through my data and doing string matching via my 17,000 example lookup table, I end up with decent coverage (about 42% of total data gets tagged, and I know it's 95%+ accurate).

Is it a good idea to munge this "annotated" data into shape like so (per your docs):

Annotation task JSONL
{
    "text": "Apple",
    "label": "BRAND",
    "spans": [{"start": 0, "end": 5, "label": "ORG"}],
    "answer": "accept"
}

and add it to my manual annotations before training again / moving on to other parts of the pipeline? This would greatly increase the size of my "annotated" data by 500k or more examples, but I'm not sure this is the right approach to training our model. What do you think?

Have another look at the process that created descriptions.csv , and check whether you’re discarding any features that could be useful contextual clues. Time/date stamps, source information, author tags etc can all be useful features in preparing training data.

I'm very curious about this @honnibal! What is the recommended way for preparing additional features like this and feeding it to Prodigy/spaCy? I have many other columns besides the 'Description' that could be useful features, and I assume there's a smarter way to prepare it for the model than simply tacking it on to the end of each row of text ?

Once you have a model artifact, you can start experimenting with it some more – for example, load it into spaCy for som playtesting, or see if you can improve it with an active learning-powered recipe like ner.teach that puts the model in the loop and suggests entities based on what the model is most uncertain about.

It's happening! Thank you so much @ines!!

  • auxiliary_text : Some collection of arbitrary other text, to use in addition to your target text. Text from Reddit might be useful for this, as the Reddit corpus is large and easy to work with.

How might this be used @honnibal ? It makes sense in general, that if I fed some additional domain-specific text to the model it would become better at making sense of it, but I'm not sure where in the Prodigy process to do so, or how?

That could definitely be a good thing to try. You might want to do a quick annotation pass over it though to see if you can get the 95% to 100% quickly. You might find a good approach for this is to sort the questions by confidence, so you can just spam "Accept" through the first ones. When I do this I really rely on the undo -- by the time it registers that something was wrong, I've already clicked like 5 forward...But you can just undo those and go back and fix the mistake.

spaCy's current textcat model is pretty weak at incorporating non-text auxiliary features. The textcat model could be adapted for it, as it's an ensemble already, but it's not implemented currently. I would suggest trying a non-spaCy solution such as the linear model in scikit-learn, or perhaps something like XGBoost. For a quick test you can add the extra feature tokens to the input as you suggest, but it's not a very satisfying solution.

Well, there are a few ways. One thing is just word vectors --- there more text is better, and it doesn't matter that much whether it's exactly the same domain. The new version will also let you use the spacy pretrain command, which lets you pretrain the CNN, not just the word vectors.

Another way that raw text can be useful is in the sort of data augmentation process you were doing to get more positive examples. If you can come up with a decent process that gets you sentences likely to contain no entities, that can be useful to add to your data. You do need negative examples too, after all. As an example, you might get some texts about like...apples, or the amazon rainforest, or something -- just to teach the model that there are senses of these words that aren't the products or companies.

1 Like