NER Training for Corporate Names

I’m working on a project in which we ingest news article content from a variety of sources from the web. We want to apply NER to the plain-text to extract the names of companies found in that text. So I have begun using Prodigy + spaCy to train an entity recognizer.

From a previous NER effort, I have about 400+ documents marked up in a format that I can convert to something that can be bulk-loaded into Prodigy. I also have plenty of news content that can be pumped into the ner.teach recipe (and a team of people to help annotate.) I have a couple of questions about the best way to go about training this NER model.

First, is it better to bulk-load the 400 tagged documents first, then run ner.teach, or vice versa, or does it not matter either way?

Second, company names often show up in the news like “Acme Adventures International Ltd.”. I’ve noticed that spaCy will often just tag “Acme” or “Acme Adventures” as organizations, but not the full name of the company. In the ner.teach workflow my team can only accept or reject suggestions, and having watched the best practices video Ines put on YouTube, it sounds like the right approach is to reject these instances when they occur? If so, what’s the best way to train the entity recognizer to tag the entire compay name as an organization?

It seems there are two approaches: one is to bulk load more news content and use ner.manual to markup more corporate names. The other is to use patterns, but I’m not sure I can encapsulate the variety of corporate naming styles (Corp., Co., LLC, etc.) into flexible enough patterns, or that won’t explode into 1000+ examples.

Any suggestions are greatly appreciated!

Just on this, we've had an (almost identical) use case in our work, and found that pre-baking a set of patterns from a gazette of was super-useful for this. I've copied some code below:

# articles is a list of text samples - we only used first paragraphs as these had more relevant content:
with open('text_data.jsonl','w') as f:
    f.write('\n'.join([json.dumps({'text':r for r in articles]))

# we're using a MongoDB database; this ends up like [{'Company Name':'Blabla Ltd'}]
company_names = list(db.companies.find({},{'Company Name':1}))

# function to convert company names to patterns. We found that ORTH patterns were too restrictive for this context.
def build_co_pattern(s='', label=''):
    p = [{"lower":a.lower()} for a in s.split(' ')]
    return json.dumps({'label':'COMPANY','pattern':p})

# write the patterns to company_patterns.jsonl
with open('company_patterns.jsonl','w') as f:
    f.write('\n'.join(build_co_pattern(x['Company Name'],'investor') for x in company_names))

And then, we ran

prodigy dataset company_ner
prodigy ner.match company_ner en_core_web_lg text_data.jsonl --patterns company_patterns.jsonl --label COMPANY
For about 2000 paragraphs of text.

Then:

prodigy ner.batch-train company_ner en_core_web_lg text_data.jsonl --patterns company_patterns.jsonl --label COMPANY -o init_model
Produces a fairly good initial model, which you can add further samples to using ner.train.

You could of course use the ORGANIZATION tag within the prebuilt models to bootstrap, but the downside is that a) we had a plethora of company names that were easy enough to bootstrap in and b) it was easier to make the model recognise 'Inc,' this way (the model doesn't start with the preconceptions of the prebuilt models). In spacy 2.1 you could use a rule matcher that matches patterns with ORGANIZATION followed by company suffixes i.e. INC, LLC, etc.

Not sure if that helps but would be happy to give further clarity! (We're getting good results here on an almost identical use case).

1 Like

Wow, this is great! Thanks for the reply.

I thought about dumping the companies we have in our system to create patterns like you did, but I’m wondering a) how much is too much (we have 5.5MM company records) and b) if you create patterns from a sub-set, will the model eventually be trained to recognize similar patterns for other company names?

Also, that’s a very interesting insight about tagging with COMPANY instead of ORGANIZATION. I thought about that too, but wasn’t sure if I was losing much by not starting with pre-trained model. Based on your experience, it sounds like a custom tag is the way to go.

Nice, a 5.5MM company database? Sounds like a pretty helpful asset to have! We’re training with a low-tens-of-thousands pattern (we’re focussed geographically in this context) and it’s working fine. In principle there’s a detriment to the matcher’s speed as you add more patterns, so you might want to prioritise significant companies that are ‘making waves’ in the news; but I’d give it a go and see (I imagine you’re using SEC filings or similar for company names - in that case, there’s likely to be a lot of shell companies). Imagine there may be a way of increasing performance with addiitonal RAM/CPU if needed as well - you can always throw that at it to see.

One point on that - many companies are stupidly named (in the UK there’s a company called ‘Very’) - you may want to get rid of them as match patterns so you don’t spend your life on False positives.

As I understand it (and @ines/@honnibal would have to confirm - I’m just an amateur chipping in) - spacy models shouldn’t ever just learn match patterns, but rather things like context vectors/POS tags surrounding; as such you don’t need to worry about the patterns you start with causing overfitting. That said, after your first ner.batch-train, you’ll have a model that thinks too many things are companies - but that’s fixed during the ner.teach phase.

Regarding the spanning issue, you might also want to consider this: in our experience, company names in news articles rarely include the ‘ltd’ or ‘LLC’ - so you might want to make match patterns with optional suffices (otherwise you’ll miss samples that don’t end with a suffix). You can train with ‘reject’ where there’s a suffix and the match group doesn’t capture it.

One final point - as news outlets come out with ‘data’ propositions, they’re doing lots of article tagging. You might be able to scrape the tags as well (if that’s what you’re doing as I suspect) and use them to create synthetic datasets. Then you can pipe them in by using db-in. Just make sure you create negative samples as well! (On a side note, I’m fond of synthetic data augmentation - and if you’ve got a company database you can double your sample size by choosing company names at random to replace other company names with - if ‘Google’ is a company in ‘Google has released a new product’, then ‘Acme’ is a company in ‘Acme released a new product’).

Wow, thanks for sharing your workflows and tips – I really enjoyed reading this thread and glad to hear things have been working well so far :smiley:

And yes, the idea is that things like company names usually occur in similar-ish contexts, so if we show the model some examples of it, it'll be able to generalise and predict other company names in unseen texts that weren't in the initial data. But in order to do this, we need examples in context. You could go through a bunch of texts and label all company names by hand, but this is often unnecessarily tedious. So with Prodigy and patterns, you can load in an existing dictionary and pre-select the spans so you only have to say yes or no, or correct the mistakes. Even if you patterns only cover 30% of the entities in the examples, that's still 30% less work for you! :tada:

If you already have a large dictionary, I think a good place to start could be to see how far a purely rule-based approach gets you – and what it misses and what's hard to cover with only rules. For example, if your documents actually talk about the official "Acme Adventures International Ltd.", that's something rules can easily cover. However, if people also refer to that company as "Adventures", that's where it gets tricky and where you'd benefit from a statistical model that can predict whether it's a company based on the context.

So, in more practical terms: You could start by labelling a representative sample of your data by hand so you have something you can evaluate on. In your first experiment, you'd only use matcher rules based on the 5m company names, see what accuracy you get and look at the entities the rules missed. Next, you can train a model on examples of companies in context and see how it compares.

You might also want to check out spaCy's new EntityRuler, which takes pattern files to label named entities and can be used in combination with a statistical named entity recognizer. If the entity ruler runs first in the pipeline, the entities it sets define the constraints for the model. So basically, if your rules already label "Acme Adventures International Ltd.", the entity recognizer will accept that and "predict around it". So in an ideal scenario, this lets you get the best of both worlds: use rules for the unambiguous cases and get improved statistical predictions for the remaining tokens.

This is great. Thanks to both @ines and @htebmal for the pointers!

One last question, is whether it’s better to do the initial ner.match task with full article content, single sentences, or the leading paragraph (as @htebmal suggests).

Ooh, that’s just a pre processing step for us. We’re interested in extracting content about the main event of an article (in our case, transactions). Tends to be that journalists/press relations will say what we need to know in the first paragraph. The rest tends to provide surplus information that we care less about (due to poor coverage).

I think in the same way as entityruler can make sure you don’t miss anything, for us this step makes sure we don’t include anything irrelevant. But if you care about information in the rest of the article, definitely include it.

So I spent some time trying to replicate @htebmal’s suggestions, but I think I must be doing something incorrectly. Here are the steps I took:

Step 1: Dump company names

I took a subset of company name records (about 45k), ran them through the spaCy tokenizer and created a patterns file that looked like this:

{"label": "COMPANY", "pattern": [{"lower": "Ratemyagent.com"}, {"lower": "Pty.,"}, {"lower": "Ltd."}]}
{"label": "COMPANY", "pattern": [{"lower": "Groupe"}, {"lower": "Qualiconsult"}]}
{"label": "COMPANY", "pattern": [{"lower": "JEG's"}, {"lower": "Automotive,"}, {"lower": "Inc.,"}]}
{"label": "COMPANY", "pattern": [{"lower": "D"}, {"lower": "and"}, {"lower": "F"}, {"lower": "Equipment"}, {"lower": "Sales,"}, {"lower": "Inc."}]}
{"label": "COMPANY", "pattern": [{"lower": "Quil"}, {"lower": "Health"}]}

This file is called company_name_patterns.jsonl

Step 2: Collect News Paragraphs

Next, I grabbed a random sample of 2000 news articles from our news corpus, extracting the headline paragraph. This file is called text/first-paragraphs-20190328.jsonl

Step 3: Create the new dataset

prodigy dataset company_ner

Step 4: Run ner.match

prodigy ner.match company_ner en_core_web_lg text/first-paragraphs-20190328.jsonl --patterns company_name_patterns.jsonl

This eventually launched the webserver and I was able to start annotating. After 16 examples, the UI gave me the “No Tasks Available” message. So I killed that process and moved on.

Step 5: Run batch-train

prodigy ner.batch-train company_ner en_core_web_lg --label COMPANY -o init_model

This returned with this result, which looks odd to me:

Using 1 labels: COMPANY

Loaded model en_core_web_lg
Using 50% of accept/reject examples (0) for evaluation
Dropout: 0.2  Batch size: 4  Iterations: 10


BEFORE     0.000
Correct    0
Incorrect  0
Entities   0
Unknown    0


01         0.000      0          0          0          0          0.000
02         0.000      0          0          0          0          0.000
03         0.000      0          0          0          0          0.000
04         0.000      0          0          0          0          0.000
05         0.000      0          0          0          0          0.000
06         0.000      0          0          0          0          0.000
07         0.000      0          0          0          0          0.000
08         0.000      0          0          0          0          0.000
09         0.000      0          0          0          0          0.000
10         0.000      0          0          0          0          0.000

Correct    0
Incorrect  0
Baseline   0.000
Accuracy   0.000

Model: /Users/avollmer/Development/spacy-ner/init_model
Training data: /Users/avollmer/Development/spacy-ner/init_model/training.jsonl

The fact that it’s all zeros in the output makes me think I didn’t do something correctly. Note that I tried running @htebmal’s suggested command-line examples, but the CLI tool rejected several of the arguments I tried for ner.match and ner.batch-train. For those I just removed unrecognized arguments until the process launched.

Any ideas?

…and of course, not two minutes after I posted this, I realized that my company_name_patterns.jsonl didn’t include lower-case tokens, despite the declaration of LOWER in the file. I fixed that, ended up with way more annotations and am actually getting something useful out of ner.batch-train.

However I do have a couple of follow-up questions:

  1. The text we process will often (but not always) include the full company name, including what we call “corporate stop words” (corp, llc, ltd, etc.). Strictly speaking, matching on “Acme Adventures, Inc.” is more correct than just “Acme Adventures”. But if spaCy is matching on the latter, that works for us too. If so, should I be accepting those matches during the training phase? (my instincts tell me, yes)

  2. How easy is it to mix a model trained on this new COMPANY entity with the default one? We would also like to be able to match on a handful of other standard entity types like PERSON, DATE and MONEY.

So definitely reject those with the wrong boundary - you want a consistent training set and uniform treatment where possible (plus, this stops the model suggesting both ‘Acme’ and ‘Acme Corp’ as NEs). As @ines says in almost every talk, make sure you don’t go accepting anything you wouldn’t want back! If you’re concerned about too many negative samples, you can always skip some samples to maintain a ratio (this is imperfect but shouldn’t harm your model).

Re your question 2, this is all to do with catastrophic forgetting - I’ve not had to deal with this before but there’s loads in the forums about this to look at.

I’ve also noticed in your samples some full stops/commas - might be worth checking whether these patterns are working/not working? My code assumed no punctuation, and generally punctuation is non-deterministic (so try and do what you can to avoid it affecting your model).

Another idea: why not try using some match patterns like so:

{"label": "COMPANY", "pattern": [{"is_title":true}, {"orth": "Corp"}]}
{"label": "COMPANY", "pattern": [{"is_title":true},{"is_title":true}, {"orth": "Corp"}]}

This will capture 1-or-2 title-cased words followed by ‘Corp’; you could extend to, say, 3 spans for this, then iterate over the suffices that are relevant in order to pipe into ner.match? This may pick up more matches than you’d get from your wordlist if it’s giving you few matches. You can always bundle them together.

I've been digging through posts and the docs to find an example of how to use EntityRuler with Prodigy. I've seen you mention approaches like the one above, but I can't find any code examples of how to implement it with Prodigy.

For example, I can start with:

from spacy.lang.en import English
from spacy.pipeline import EntityRuler

nlp = English()
ruler = EntityRuler(nlp)
patterns = [{"label": "APPLE_OS", "pattern":[{"text": "iOS"},{"LIKE_NUM": True}]}]
            
ruler.add_patterns(patterns)
nlp.add_pipe(ruler)

# See the results and entity tagging
doc = nlp(row_text)
print([token.ent_type_ for token in doc])

And then I know how to play with this nlp and see how it works, but how do I save the model (or do I just save the patterns? and how to export them?) and then train it further / correct it with Prodigy annotations in the loop?

EDIT:

I seem to have solved my own problem: it looks like all I need to add to the above code is something like: nlp.to_disk('custom_model') and then run something like prodigy ner.teach os_tagging custom_model test_data.jsonl --label APPLE_OS

Feel free to correct me or extend this idea if you stumble on this @ines !

1 Like

Yes, calling nlp.to_disk is all you need to save the model with the entity ruler – the patterns will be serialized as JSON automatically and stored with the model, and then loaded back in. That was part of the concept of the entity ruler, to make it super convenient to use :slightly_smiling_face:

While you can do this, keep in mind that this isn't actually using the entity ruler. The entity ruler is fully rule-based and only sets the doc.ents manually based on your patterns. It can't learn anything, and you also can't update it in the loop. So the suggestions you see when running ner.teach are only the ones made by the statistical model, and only the statistical model will be updated in the loop.

However, you could run ner.make-gold, which will pre-highlight whatever it finds in the doc.ents. This will also include the entities set by your entity ruler. You can then correct them and create gold-standard data. Ideally, this is much faster than labelling everything from scrach. (Even if your entity ruler only gets 50% of the entities, that's still 50% less work for you!)

So, I experimented with this today, but in the end, I'm a bit lost in the ideal pipeline and which components to use.

Right now, my dataset is something like 40 million records of text, each from one sentence to one paragraph in length.

I have a list of 15,000 patterns created which identify brand names as 'BRAND', this data lives in a Google Sheet that is added to, daily, by a number of people on our team as new brands are discovered.

I have a model, brand_tag_alpha_2019_06_04, which has been been trained via ner.teach with about 10,000 annotations.

I'm pulling the Google Sheet, tokenizing it into the patterns format like so: {'label': 'BRAND', 'pattern': [{'orth': 'Coca'}, {'orth': '-'}, {'orth': 'Cola'}]},, and then adding an EntityRuler which uses those patterns, then exporting the model:

from spacy.pipeline import EntityRuler

nlp = spacy.load('brand_tag_alpha_2019_06_04')

ruler = EntityRuler(nlp)
ruler.add_patterns(patterns)
nlp.add_pipe(ruler, before='ner')

nlp.to_disk('ner_model_alpha_2019_06_06')

So, now I have a trained model, with the new EntityRuler in the pipeline.

My question is: How do I most efficiently annotate and improve the model from here while taking advantage of EntityRuler ?

I've now read that ner.teach ignores EntityRuler -- is there a method by which the model can continue to be trained by an annotator that doesn't ignore EntityRuler, where we can get the benefit of EntityRuler 'learning the constraints of the entities' and thus don't double-duty on annotations that should already be caught by exact pattern matching?

It's completely possible I'm missing the ideal workflow here w.r.t. EntityRuler and Prodigy at large, but I completely appreciate your continued illumination @ines! Thanks again.

Ah, I think that might be a small misunderstanding: The entity ruler itself doesn't learn anything, just like the statistical entity recognizer doesn't "learn" anything from the entity ruler. But if entities are already set in a previous pipeline step and the statistical entity recognizer encounters them, it will "predict around them" and use them as constraints for its predictions. This means that a pre-trained statistical NER model may produce better results and make fewer mistakes, because some of the wrong predictions it woud have made otherwise are now impossible or very unlikely.

For a quick overview of how the entity labels are predicted and what the BILUO scheme (e.g. B-PERSON etc.) means, see my comment here: EntityRuler causes NER entities to go missing · Issue #3775 · explosion/spaCy · GitHub

To give you an example, let's say you have a sentence like: "He works at John Doe's ACME Inc.". Your model may analyse it like this and incorrectly predict "John Doe's ACME Inc." as a company (which isn't even so far-fetched, but it's obviously wrong):

["He", "works", "at", "John", "Doe", "'s", "ACME", "Inc."]  # Tokens
["O", "O", "O", "B-ORG", "I-ORG", "I-ORG", "I-ORG", "L-ORG"]  # Predicted entity tags

Now imagine you have the entity ruler in the pipeline before the named entity recognizer and "ACME Inc." is covered by a pattern. It'll assign the entity for those tokens (B-ORG, i.e. beginning of an entity, and L-ORG, i.e. last token of an entity):

["He", "works", "at", "John", "Doe", "'s", "ACME", "Inc."]  # Tokens
["?", "?", "?", "?", "?", "?", "B-ORG", "L-ORG"]  # Entity tags added by the EntityRuler

Next, the statistial entity recognizer is applied and encounters this state. Predicting "John Doe's ACME Inc." as a company is now imposible, because we already know that "ACME" is a B-ORG, i.e. the beginning of an ORG entity span. So it can't also be inside an entity span. The entity recognizer will only fill in the gaps and predict the labels for the other tokens – and if you're lucky, the correct analysis where "John Doe" is a person will now be a lot more likely in this context:

["He", "works", "at", "John", "Doe", "'s", "ACME", "Inc."]  # Tokens
["?", "?", "?", "?", "?", "?", "B-ORG", "L-ORG"]  # Entity tags added by the EntityRuler
["O", "O", "O", "B-PERSON", "L-PERSON", "O", "B-ORG", "L-ORG"]  # Final entity tags with predictions

It's the same model with the same weights – but it's able to make better predictions because your rules have defined better constraints for them at runtime.

2 Likes

From this gold-standard data, is it possible to then do something like to-patterns in order to extend our patterns.jsonl file? Likewise, is it possible to mark a term as "definitely not" the entity in question, if I see that the model has incorrectly predicted it?

Right now, I find myself randomly paging through predictions like so in a Jupyter Notebook:

random_index = np.random.randint(1, len(text_json))
doc = nlp(text_json[random_index]['text'])

for ent in doc.ents:
    print(ent.text, ent._.entity_norm, ent.label_)    

# Visualise the entities
colors = {"BRAND": "linear-gradient(90deg, #aa9cfc, #fc9ce7)"}
options = {"ents": ["BRAND"], "colors": colors}
spacy.displacy.render(doc, style='ent', jupyter=True, options=options)

And then when I find a brand name that doesn't exist in my entity normalization dict (using code inspired by this post, except the entity norm outputs N/A if it cannot be normalized, meaning we don't have a record of the entity yet), I then manually add it to the spreadsheet that my patterns.jsonl is created from.

As i've been sitting here doing that, I assume this must be a task I can manage more effectively with the Prodigy methodology somehow.

Is it possible to only annotate examples that were not caught by the EntityRuler and then export those annotations to patterns.jsonl ?

Can't thank you @ines enough for your on-going support, and apologies for hijacking this thread -- although I do think it's still quite relevant to the original topic :sweat_smile:

Yes, check out this user-contributed recipe that implements a terms.manual-to-patterns workflow:

Basically, all the recipe does is read in a dataset annotated with ner.manual, and then create a pattern for each of the annotated spans in it. You might want to adjust the code to only create patterns once and if they don't exist yet (otherwise, you'll get duplicates).

Another idea could be to impement a custom recipe that uses spaCy's Matcher or PhraseMatcher to find and pre-highlight spans and use the manual NER interface. If any entities are missing, you can highlight them manually. When the examples are sent back to the server, you can extract those and add them to the Matcher. So you'd be using an update callback just like the active learning recipes – only that you're not updating your model, but a matcher in the loop. The new matcher will then be applied to the stream and the longer you annotate, the fewer spans you'd ideally have to highlight manually.

Here's a code example to illustrate the idea – haven't tested it yet, but something like this should work:

import prodigy
from prodigy.components.loaders import JSONL
from prodigy.components.preprocess import add_tokens
import spacy
from spacy.tokens import Span
import srsly


@prodigy.recipe("manual-match")
def manual_match(dataset, source, spacy_model):
    nlp = spacy.load(spacy_model)
    matcher = Matcher(nlp.vocab)

    def add_spans_to_stream(stream):
        # Run matcher over each example in the stream and add a "spans" property
        # to each task that includes the matched spans (so they're pre-highlighted).
        for eg in stream:
            doc = nlp.make_doc(eg["text"])
            matches = matcher(doc)
            matched_spans = [Span(doc, start, end, label=match_id) 
                             for match_id, start, end in matches]
            spans = [
                {
                    "start": span.start_char, 
                    "end": span.end_char, 
                    "label": span.label_, 
                    # Indicate that this was added automatically
                    "by_matcher": True
                }
                for span in matched_spans]
            eg["spans"] = spans
            yield eg

    def update(answers):
        # Update the matcher with patterns based on highlighted spans in the
        # annotations that come back
        for answer in answers:
            text = answer["text"]
            for span in answer.get("spans", []):
                # Only add new manually added spans, not the ones that were set
                # automatically
                if not span.get("by_matcher"): 
                    doc = nlp.make_doc(text[span["start"]:span["end"]])
                    label = span["label"]
                    pattern = [{"lower": token.lower_} for token in doc]
                    matcher.add(label, None, pattern)

    def on_exit(ctrl):
        # When the Prodigy server is stopped, serialize the patterns to a file
        result = []
        for label_id, patterns in matcher._patterns.items():
            label = nlp.vocab.strings[label_id]
            for pattern in patterns:
                result.append({"label": label, "pattern": pattern})
        srsly.write_jsonl("/path/to/patterns.jsonl", result)
    
    stream = JSONL(source)
    stream = add_tokens(nlp, stream)
    stream = add_spans_to_stream(stream)

    return {
        "dataset": dataset,
        "stream": stream,
        "update": update,
        "on_exit": on_exit
        "view_id": "ner_manual",
        "config": {
            "batch_size": 3  # low batch size so we see results faster
        }
    }

Edit: Actually, here's a much simpler version using the EntityRuler. Newly annotated spans are added to the entity ruler as patterns, and in the end, it's serialized out as a JSONL file.

import prodigy
from prodigy.components.loaders import JSONL
from prodigy.components.preprocess import add_tokens
import spacy

@prodigy.recipe("manual-match")
def manual_match(dataset, source, spacy_model):
    nlp = spacy.load(spacy_model)  # Let's assume this has an entity ruler
    ruler = nlp.get_pipe("entityruler")

    def add_spans_to_stream(stream):
        # Add a "spans" property to each task that includes the doc.ents 
        # (so they're pre-highlighted).
        for eg in stream:
            doc = nlp.make_doc(eg["text"])
            eg["spans"] = [
                {
                    "start": ent.start_char, 
                    "end": ent.end_char, 
                    "label": ent.label_, 
                    # Indicate that this was added automatically
                    "by_model": True
                }
                for ent in doc.ents]
            yield eg

    def update(answers):
        # Update the matcher with patterns based on highlighted spans in the
        # annotations that come back
        for answer in answers:
            text = answer["text"]
            patterns = []
            for span in answer.get("spans", []):
                # Only add new manually added spans, not the ones that were set
                # automatically
                if not span.get("by_model"):
                    doc = nlp.make_doc(text[span["start"]:span["end"]])
                    label = span["label"]
                    pattern = [{"lower": token.lower_} for token in doc]
                    patterns.append({ "label": label, "pattern": pattern })
            ruler.add_patterns(patterns)

    def on_exit(ctrl):
        # When the Prodigy server is stopped, serialize the patterns to a file
        ruler.to_disk("/path/to/patterns.jsonl")
    
    stream = JSONL(source)
    stream = add_tokens(nlp, stream)
    stream = add_spans_to_stream(stream)

    return {
        "dataset": dataset,
        "stream": stream,
        "update": update,
        "on_exit": on_exit
        "view_id": "ner_manual",
        "config": {
            "batch_size": 3  # low batch size so we see results faster
        }
    }
2 Likes

So I’m picking up this thread again because we’ve altered our approach slightly, but I could use some guidance on the best path forward. First, let me summarize our plan and what we’ve done so far:

  • Our goal is to build an NER that labels PERSON and a new entity COMPANY
  • We bootstrapped a dataset for COMPANY annotations using text that contained company names identified by a patterns file containing the specific company names that should be found (using ner.match)
  • Next, we dumped a different source of news along with a generic patterns file for matching on “corporate” names and annotated that with another round of ner.match

The two rounds of ner.match only included “COMPANY” for the --label argument.

Now I would like to build an evaluation dataset so that I can, eventually, run ner.batch-train. Our annotations thus far have been stored in the company_ner dataset and our evaluation set will be called company_ner_eval. So I ran ner.eval like so:

prodigy ner.eval company_ner en_core_web_md ./business_news_eval.jsonl --label COMPANY,PERSON

When I ran this recipe, the annotation UI only asked me to label PERSON with no COMPANY suggestions. I’m guess that this is because the model (“en_core_web_md”) doesn’t have the COMPANY label. So do I need to train an interim “company_ner” model with ner.batch-train so that I can create the evaluation set from which I will ultimately build the final model?

Furthermore, with a goal of training for a new entity type and supporting an existing one, are we using the right techniques to make sure our model supports both without the “catastrophic forgetting problem”?

Thanks!

1 Like

I am having similar issues. I have been able to train my model using patterns and annotations in order to recognize a new entity. I call this entity “OCCUPATION” and it has been working successfully. However, when I run this model, although the “OCCUPATION” tagger works correctly, it has broken the original NER that was included in the en_core_web_lg.

As an experiment I tried training a mode (using prodigy ner.batch-train) from the initial set of annotations I had gathered (~3000 annotations). After the model was built I thought I would see if I could build an evaluation set with a new set of content (using prodigy ner.eval).

Once the annotation UI started it started prompting me to accept/reject every token in the text as a COMPANY, which seemed like I had done something wrong. To confirm, I ran that same content through ner.print-stream and saw this:

Interestingly, when I page further through the output, there are some articles in which it looks like companies are being tagged correctly:

I can’t help but feel I’ve done something pathologically incorrect. Any ideas?

@alexvollmer The short answer is: yes, you’ll want to train a two-entity model with your COMPANY and PERSON labels.

You would have a lot of trouble starting from one of our pretrained models for your task, because your new entity COMPANY will overlap very strongly with the previous entity type ORG. Almost always when the model is predicting ORG you’ll be telling it no, that’s COMPANY – so the model will be very confused at first.

@jonathankingfc You should be able to have a spaCy pipeline with both NER taggers, running one after another. I’ve just had a look at this though, and there’s some usability problems in spaCy around this that I need to take care of. The following snippet should work around the problems until I get this fixed.


from spacy.pipeline.pipes import EntityRecognizer
from spacy.language import Language

class NonFinalEntityRecognizer(EntityRecognizer):
    """Predict entities, but mark non-predicted tokens as "uncertain", rather than definitively "O"."""
    name = "non_final_ner"
    @property
    def postprocesses(self):
        """These functions are run after the parsing is complete."""
        return [mark_non_final]

def mark_non_final(doc):
    # Now this is really obscure :(. When we assign the entities, in
    # the parser, we currently mark them as definitively O. But in the 
    # doc.ents setter, non-provided entities are marked as *missing*.
    # This is obviously not a great situation, so the hack here will likely
    # stop working in future versions.
    doc.ents = list(doc.ents)

Language.factories["non_final_ner"] = lambda nlp, **cfg: NonFinalEntityRecognizer(nlp.vocab, **cfg)

# With the factories entry set, we'll be able to do:
nlp = spacy.load("en_core_web_sm")
nlp.add_pipe(nlp.create_pipe("non_final_ner", before="ner"))

It’s important that the ‘non_final_ner’ gets added before the normal ner, as we’re setting a post-process to open up the predictions for the subsequent NER model.

If the factory setting is in place, you’ll also be able to have a serialized pipeline that includes a non_final_ner model. You can assemble the required model directory for this yourself rather easily: in the model directory that Prodigy saves out, rename the ner directory to non_final_ner, and then copy in the ner directory from default en_core_web_lg model. Next update the meta.json file, so that both pipeline components are listed.

Once you’ve got the directory right, you should be able to use it with spacy.load(), so long as your script has already set the Language.factories["non_final_ner"] entry.

I know this is a lot of work for something that should be easy — we’ll definitely get this fixed in spaCy. But for now the above workflow should workaround the problem.