Manual NER with huge count of entities


We are going to use a Named Entities interface with manual entity selection.

Our entity count is huge (aboot 3000 entities), and it is not possible to place them all above the text. But they can be grouped into few classes.

Is it possible to create a list of sub-entities for each entity, so every user can select a class of entity first, and then pick up an exact entity?

In cases like this, we always recommend starting with a smaller label set and gradually expanding it by making more passes over the data. You don’t want to overwhelm your annotators with hundreds or thousands of possible options. You also don’t want them to spend a lot of time navigating down a tree to select a very specific sub-label. At least, this is what Prodigy’s philosophy tries to avoid by letting you write semi-automated workflows.

If your categories are hierarchical, you can start off by annotating the top-level categories. You can then make another pass over each top-level category and go more specific.

Maybe you can even find a smarter semi-automated way to prevent human error and take care of more categories at once. For instance, if you know that a PERSON entity “Obama” is always also a PERSONPOLITICIAN, there’s no point in asking the annotator about every single instance of that entity. If anything, you probably introduce more errors that way. Instead, you could frame this task as a multiple-choice question using the "choice" interface, and only ask about every entity once. You might still end up with some ambiguous examples you want to ignore, but you can always deal with those afterwards.

Once you’re done, you can export the annotations and have a script that combines them all into one large dataet.

(Btw, just a quick note. You’re probably aware of this since you’ve been working with this label scheme, but just case others come across this thread later: 3000 distinct, separate categories is most likely not suitable for a standalone Named Entity Recgonition task. There’s not really a point in actually training a model on all of these categories – instead this type of problem is usually solved as an entity linking task, where top-level entities are predicted, linked to a knowledge base and later populated with more fine-grained categories.)