I have a conceptual question about NER models:
Our use case:
We are working on categorizing and extracting data from bank turnover purposes. Currently, we can predict one or several categories for a bank turnover using a multilabel textcat model.
Additionally, we want to extract specific data from the bank turnovers, such as amount breakdowns (if present) or other details like customer and contract numbers. We are using two NER models for this purpose, each specialized for different tasks and applied based on the recognized text categories.
For example, when the categorization model identifies a financing text category for a given bank turnover, we apply an NER model to extract INTEREST_AMOUNT and AMORTISATION_AMOUNT, which are often included in this context.
We now face the need to extract more information, which could involve adding more NER models. For instance, we might add a model for tax turnovers to extract labels such as TAX_TYPE or similar.
The extracted labels typically belong to specific problem domains but are all derived from the same bank-turnover corpus.
Question:
Which approach is likely to yield the best results?
-
Maintain the current base text-categorization model that predicts categories, and based on these predictions, apply specific NER models. This would mean extending our current setup from two to five or six NER models.
-
Unify the two existing NER models into one, while keeping the base text-categorization model. To handle additional label extraction, we would add examples for the new extractions. If new labels overlap with existing ones, we would back-label the existing examples to avoid conflicting training data.
Reasoning behind the question:
• If we stick with Approach 1, we are concerned about conflicting predicted labels, which could complicate handling within our rule-based applications.
• If we switch to Approach 2, it simplifies ML operations (fewer models to maintain) and reduces the risk of conflicts. However, we’re concerned about potential degradation in prediction quality since the two specialized NER models currently perform well. BTW one of the NER models uses a custom tokenizer, so we'd have to unify those too.