[NER and Topic Classification] Methodological questions


I would like to train a NER model to classify organizations in the startup ecosystem (label such as "Startup", "Incubator", "Startup Studio", "Investor", "Universities", "Public Institutions"). I have a database of several million tweets / articles to train the algorithm.

I want to be able to easily add new NER labels (if asked by a client ; for example "Corporate" or "Accelerator") and I am wondering if - in this situation - it is relevant to train one NER classifier for each type of entity to be more flexible and to be able to easily add a new label.

I have a related issue for topic classification. I want to implement a flexible methodology to easily add new topic over time. Here again, I'm thinking about one-vs-all binary classifier (one for each topic) instead of a multi-class supervised classification. It might also be easier to train as the number of topic is already large (about 30 topics).

What do you think of the advantages / drawbacks of those two methods ? I would like to have a methodology that allows me to easiliy add new label (or new topics) over time (when asked by a client or when a new topic / type of institution emerges).

Thank you very much,

Hi Thomas,

I think the best approach for you will be to use only one label for the actual NER part, and then to do a separate linking step where you resolve the entity to a knowledge base and decide for each company in the knowledge base whether it's a startup, incubator, startup studio etc.

To explain this, let's use the term "entity" to refer to the actual thing in the real world (the organization or whatever), and the term "mention" to refer to the piece of text that refers to the entity in an article.

The classification into "Startup", "Incubator", "Startup Studio" etc is a property of the entity, it's not a property of the mention. You can't have an entity that should be labelled a "Startup" in some discourse contexts but a "Startup Studio" in other discourse contexts, right? So you'll be able to control and reason about the behaviour of your system much better if you structure it to reflect this fact.

If you re-predict the classification on each mention, you'll have to deal with situations where the model is flipping its prediction for a single entity, even within a single document. This will be really difficult to work with, and you'll probably end up having to resolve the discrepancy with a post-process anyway.

In practice, the system for doing the "linkage" and classification will probably be quite simple. You can use a substring match system to associate entities within a single article. Especially within news texts, you'll usually get one full-name mention of the entity --- often with a hyper-link to fully disambiguate. Then you'll have various shortened forms. You then just have to decide once whether all mentions of "DoorDash" are a startup. You probably only have a few thousand entities in your domain, so you can do the common ones manually, and then have pretty simple heuristics to handle the other ones.