Location of rule based component in the pipeline

rhayderc · August 10, 2021, 8:56pm

I trained a model to detect named entities like DRUG, DISEASES and HOSPITALS using prodigy. I also have a component that uses the rule-based Matcher and PhraseMatcher to detect the same entities for simple known cases for which I have a list of patterns.

As the model was trained without any knowledge of the rule-based component, does the order in which the rule-based component is added to the pipeline matter? If I place the rule-based component before the ner component in the pipeline, will it throw off the ner components ability to make predictions?

ines · August 11, 2021, 12:22am

Yes, if your component runs before the named entity recognizer and sets annotations to the doc.ents, those will be available when the entity recognizer runs, so it will essentially "predict around them" and respect the spans that are already set.

Whether this is good or bad really depends on your use case: sometimes, it can be very helpful to pre-label certain entities you know are always going to be correct. This can prevent the model from making certain mistakes, because the most likely interpretation of a given text will change if one or more entities are already set, and it also won't predict anything that conflicts with your existing matches.

However, there can also be cases where this leads to problems. For instance, imagine you have a phrase "ACME Cancer Center", which you want to label as HOSPITAL, but your patterns contain an entry for [{"lower": "cancer"}]. In that case, "Cancer" would be pre-labeled and the model would never get to consider "ACME Cancer Center" as an entity, even if this would otherwise be the most confident analysis. So if you're setting entities before the entity recognizer, you typically want those to be very specific and as unambiguous as possible (e.g. matching "ACME Cancer Center" would be a lot better here).

The best way to find out what works best is to run separate evaluations with your different pipeline configurations (NER only, NER + matcher, matcher + NER and so on). You can use spacy evaluate or just write a simple Python script that runs the model over your evaluation data and compares the entity spans. It should also be very helpful to look at the mistakes in each case, to find out if there are any common error patterns that you can easily fix by adjusting your matcher rules.

rhayderc · August 11, 2021, 1:47am

Thank you for the detailed response @ines. Much appreciated

Whether this is good or bad really depends on your use case: sometimes, it can be very helpful to pre-label certain entities you know are always going to be correct. This can prevent the model from making certain mistakes, because the most likely interpretation of a given text will change if one or more entities are already set, and it also won't predict anything that conflicts with your existing matches.

In my case, I expect the component preceding the ner component to consist of "golden" phrases. I expect them to be always correct and unambiguous.

However, there can also be cases where this leads to problems. For instance, imagine you have a phrase "ACME Cancer Center", which you want to label as HOSPITAL , but your patterns contain an entry for [{"lower": "cancer"}] . In that case, "Cancer" would be pre-labeled and the model would never get to consider "ACME Cancer Center" as an entity, even if this would otherwise be the most confident analysis. So if you're setting entities before the entity recognizer, you typically want those to be very specific and as unambiguous as possible (e.g. matching "ACME Cancer Center" would be a lot better here).

I also have other components placed after the ner component in the pipeline. For example, there is an EntityRuler component that loads patterns from disk with overwrite=False. I use the components after the ner as a fall back incase the model misses some entities.

The best way to find out what works best is to run separate evaluations with your different pipeline configurations (NER only, NER + matcher, matcher + NER and so on). You can use spacy evaluate or just write a simple Python script that runs the model over your evaluation data and compares the entity spans.

I had not thought about this. Will get on this.

Topic		Replies	Views
Does the outputted model contain the custom pipeline components? usage , spacy , solved	3	1022	November 28, 2018
Training NER with Entity Ruler ner , spacy , solved	5	2243	November 21, 2019
NER or PhraseMatcher? ner , spacy , best-practices	17	6092	September 20, 2018
Adding a custom NER to a pipeline overrides an original NER usage , ner , spacy	5	4193	September 24, 2018
Prodigy with AllenNLP model usage , allennlp	3	918	March 4, 2021

Location of rule based component in the pipeline

Related topics