I’m working on a project in which we ingest news article content from a variety of sources from the web. We want to apply NER to the plain-text to extract the names of companies found in that text. So I have begun using Prodigy + spaCy to train an entity recognizer.
From a previous NER effort, I have about 400+ documents marked up in a format that I can convert to something that can be bulk-loaded into Prodigy. I also have plenty of news content that can be pumped into the ner.teach
recipe (and a team of people to help annotate.) I have a couple of questions about the best way to go about training this NER model.
First, is it better to bulk-load the 400 tagged documents first, then run ner.teach
, or vice versa, or does it not matter either way?
Second, company names often show up in the news like “Acme Adventures International Ltd.”. I’ve noticed that spaCy will often just tag “Acme” or “Acme Adventures” as organizations, but not the full name of the company. In the ner.teach
workflow my team can only accept or reject suggestions, and having watched the best practices video Ines put on YouTube, it sounds like the right approach is to reject these instances when they occur? If so, what’s the best way to train the entity recognizer to tag the entire compay name as an organization?
It seems there are two approaches: one is to bulk load more news content and use ner.manual
to markup more corporate names. The other is to use patterns, but I’m not sure I can encapsulate the variety of corporate naming styles (Corp., Co., LLC, etc.) into flexible enough patterns, or that won’t explode into 1000+ examples.
Any suggestions are greatly appreciated!