NER with dozens of entities


We plan to use Prodigy for NER, however we have around 40 different entities to extract from large documents (+20 pages). We split the document per pages, however the UI does not seem to handle well all these entities-> the 40 entities hides the text to annotate so that that UI is not usable.

Is there a workaround, like an option to put the entities of the side instead of above the text?

I had the same situation, but with image annotation. @ines gave me info that helped me out: Regroup and change positions of labels in image_manual tasks?


As a more general comment/question: Is your goal to actually train an NER model? If so, are you sure you want to structure your annotation task and model this way and have that many labels? You might be making your life a lot harder this way, because you'll need significantly more data and make sure you have enough examples per label so the model can distinguish between them. If the labels are hierarchical, you'd also be encoding multiple aspects within each annotation that the model has to learn: every layer of the hierarchy, plus the token boundaries. For each token, the model has to make all of these decisions, which can make it harder to achieve accurate results.

If your label scheme is hierarchical, you could experiment with starting off with the top level, which is also typically the most crucial one (e.g. is it a product or a plant?). Once you have a model that can predict these with sufficient accuracy, you can add a second step to predict/determine the lower level categories. For example, given an entity span with the label PLANT, which type of plant is it?

For some labels, you might find that a simple rule-based approach (e.g. dictionary lookup) performs much better. For other labels, you can make a prediction given what you already know, which can be a much easier prediction problem, since there's only a small number of labels to pick from. Depending on the problem, using a text classifier to predict categories over the whole text and context can often be useful as well to add more concrete information to the entity predictions.

1 Like

Hi @ines,

I want to thank you for this extended answer, this is very valuable.

Our labels are not hierarchical and the problem is quite hard: extracting a large number (~40) of technical information from long documents coming in various templates (the entities stay the same but the structure of the document can change).

We already partly implemented a rule-based approach to extract some entities. However it requires to build specific rules for each template, which is very time consuming (linear amount of work w.r.t the number of templates). The Idea of using NER was for us to build a solution that can generalize across various templates.

With your answer, I understand that the NER state of the art may not provide a standalone solution and that we need to combine it with either template-specific rules or a first classification layer. I very much welcome any other advices/resources to approach this problem in a good way.

Hi @Rolodex , if I may add my two cents :slight_smile:

My first suggestion would have also been to check whether some kind of hierarchical scheme to classify the entities would make sense, as that would benefit both the annotation step as well as the actual training. But I understand that this may not be an option in your case.

With respect to rule-based vs NER. Taking a step back, ignoring technical/time constraints for a second, I think the main motivation behind chosing rules or ML is the amount of ambiguity and variability in your texts and entities. If there are words or patterns that are highly likely to lead to correct entities, it certainly makes sense to draft rules for those. Even though that may be time-consuming, rules have the advantage of being more understandable to end users, as well as providing high precision (usually).

On the other hand, if you have words/patterns that are ambiguous and it would depend on the context whether they should be annotated as entity or not, then I would definitely rely on ML algorithms. The same is true if you have entities that have a large lexical variation, let's say company names. It's almost impossible to write rules to find all company names in a text, because again this highly depends on the context. Likewise, as a human you would be perfectly able to deduce a company name from the surrounding sentence, even if you never heard of the company. That to me is then a good sign it should be a ML (NER) model.

Hope that helps :wink: