Workflow for training NER on multiple entities

This tool has been awesome to get started quickly, experiment, and iterate on a fun information extraction project - thanks for your hard work!

First, I’m curious what to expect as we add more and more entities. For example, we started with PERSON, and trained that up to really good results by itself. Adding additional entities (pre-trained or not) hasn’t had as good results yet, but I have some ideas that I’m addressing separately, with some very helpful advice in this thread. But I’m also wondering if its seem like NER should get better or worse, as more entities get added? I’m sure a lot depends on the quality of the annotations and how well the entities train independently, but curious on what to expect here at a high level.

And secondly, I think I saw some advice (sorry, can’t find where) to keep all of a project’s annotations together in one dataset, which has been my approach so far. But in my ideal scenario (from a data collection perspective), I’m starting to feel like I want to do separate annotations / dataset we built up for each entity type, and merge them together (via some workflow like db-out -> concatenate jsonl files -> db-in) to create a composite dataset for training. This way, if (well, when) we change our mind about how to annotate entity XYZ, we could only have that decision (and resulting re-annotation work) impact a portion of the annotations (vs. all of them). And I think what I’m reading here is that this incremental approach makes sense?

Thanks for the kind words! Glad to hear you’ve had good results on the PERSON entity.

In general I think tasks do get harder as more entity types are added, especially if the types get fine-grained, and some of the entities are rare. The model also can sometimes struggle when new entity types are introduced to a pre-trained model, if little training data is available. A new entity type may force the model to update a lot of its weights, which can lead it to forget the previous categories unless you mix in text the model has previously annotated.

To answer your second question: It’s definitely fine to keep your annotations separate, although of course you’ll need to join them up when you go to train a combined model. I would suggest you create a separate evaluation set though, to make sure you can directly compare results even if you’re running experiments under different configurations.