Use one big NER model or a lot of smaller ones?

I have a conceptual question about NER models:

Our use case:

We are working on categorizing and extracting data from bank turnover purposes. Currently, we can predict one or several categories for a bank turnover using a multilabel textcat model.

Additionally, we want to extract specific data from the bank turnovers, such as amount breakdowns (if present) or other details like customer and contract numbers. We are using two NER models for this purpose, each specialized for different tasks and applied based on the recognized text categories.

For example, when the categorization model identifies a financing text category for a given bank turnover, we apply an NER model to extract INTEREST_AMOUNT and AMORTISATION_AMOUNT, which are often included in this context.

We now face the need to extract more information, which could involve adding more NER models. For instance, we might add a model for tax turnovers to extract labels such as TAX_TYPE or similar.

The extracted labels typically belong to specific problem domains but are all derived from the same bank-turnover corpus.

Question:

Which approach is likely to yield the best results?

  1. Maintain the current base text-categorization model that predicts categories, and based on these predictions, apply specific NER models. This would mean extending our current setup from two to five or six NER models.

  2. Unify the two existing NER models into one, while keeping the base text-categorization model. To handle additional label extraction, we would add examples for the new extractions. If new labels overlap with existing ones, we would back-label the existing examples to avoid conflicting training data.

Reasoning behind the question:

• If we stick with Approach 1, we are concerned about conflicting predicted labels, which could complicate handling within our rule-based applications.

• If we switch to Approach 2, it simplifies ML operations (fewer models to maintain) and reduces the risk of conflicts. However, we’re concerned about potential degradation in prediction quality since the two specialized NER models currently perform well. BTW one of the NER models uses a custom tokenizer, so we'd have to unify those too.

This is a good question and both approaches you describe are generally valid and common. I think a lot of it comes down to the data and potential overlaps you mention. It can be an advantage to have a single component to prevent false positives, since only one label can apply at a time – but if that's not a problem you have, it's also less necessary. Also, if you expect that you might want to update and retrain your NER components separately from each other, it's potentially better to keep them separate. Because if things change and it's becoming more difficult to improve one category while maintaining accuracy of the other, you may end up with more operational complexity later on.

Have you tried it out and trained a single component with both labels, evaluated on your data? This should be relatively easy to do in Prodigy by providing both datasets to train. If the results are promising, you can pursue the idea further to get the benefit of the single, easier-to-maintain model.

Thanks for the reply @ines , I actually have not tried it out yet, since the NERs do not share the same tokenizer. Meaning I'd have to relabel the first NER's examples based on the second NER's tokenizer.

If you say it is worth a shot: I could try. The examples for the first NER are not that much ~1000.
But I first wanted to ask what a valid approach is.

Hi @toadle,

Let me take over from Ines here, if you don't mind.
One thing to have in mind is that re-labelling is only necessary if the application of the new tokenizer will result in misaligned span labels i.e. after the re-tokenization, there will be NER spans across token boundaries.
You can quickly check if that's the case for your dataset by trying to create spaCy spans from your annotations. Any span annotations that are None need relabelling, otherwise these examples will be ignored for training .Example function that you can use for such check:

def misaligned_token(examples):
    counter=0
    nlp = spacy.load("path/to/custom_tokenizer/pipeline")  
    for example in examples:  
        doc = nlp(example["text"])
        for span in example["spans"]:
            char_span = doc.char_span(span["start"], span["end"], span["label"])
            if char_span is None:  
                counter+=1
                print(f"Misaligned tokens-->{example["text"]}, {span})
    print(f"Found {counter} misaligned tokens)

If it's not too much effort relabel, it's definitely worth checking if model trained on the combined NER dataset is as good as individual NER model (or good enough).
This is a tradeoff between operational costs of maintaining separate NER models, performance and the cost of back-labelling in the case of category overlap. As @ines mentioned it depends on your particular data. If you expect lots of overlap and back-labelling effort and more custom tokenization needs then it's probably better to keep models separate. If not, then it's definitely worth trying to have just one model in production. The good thing that it should be fairly cheap to do such experiment.
Finally, one more option could be a hybrid approach: when you need add a new domain you can run the evaluation how the merged model performs on all domains. If it's satisfactory then go ahead with the merge, if not, you can keep this particular domain separate. In other words: merge domains where it's practical and keep independent those that do not mix well. For that I really recommend having an eval project (e.g. spaCy project) in place so that you can do these reproducible evals without too much overhead.