I trained a model to detect named entities like DRUG, DISEASES and HOSPITALS using prodigy. I also have a component that uses the rule-based Matcher and PhraseMatcher to detect the same entities for simple known cases for which I have a list of patterns.
As the model was trained without any knowledge of the rule-based component, does the order in which the rule-based component is added to the pipeline matter? If I place the rule-based component before the ner component in the pipeline, will it throw off the ner components ability to make predictions?
Yes, if your component runs before the named entity recognizer and sets annotations to the doc.ents, those will be available when the entity recognizer runs, so it will essentially "predict around them" and respect the spans that are already set.
Whether this is good or bad really depends on your use case: sometimes, it can be very helpful to pre-label certain entities you know are always going to be correct. This can prevent the model from making certain mistakes, because the most likely interpretation of a given text will change if one or more entities are already set, and it also won't predict anything that conflicts with your existing matches.
However, there can also be cases where this leads to problems. For instance, imagine you have a phrase "ACME Cancer Center", which you want to label as HOSPITAL, but your patterns contain an entry for [{"lower": "cancer"}]. In that case, "Cancer" would be pre-labeled and the model would never get to consider "ACME Cancer Center" as an entity, even if this would otherwise be the most confident analysis. So if you're setting entities before the entity recognizer, you typically want those to be very specific and as unambiguous as possible (e.g. matching "ACME Cancer Center" would be a lot better here).
The best way to find out what works best is to run separate evaluations with your different pipeline configurations (NER only, NER + matcher, matcher + NER and so on). You can use spacy evaluate or just write a simple Python script that runs the model over your evaluation data and compares the entity spans. It should also be very helpful to look at the mistakes in each case, to find out if there are any common error patterns that you can easily fix by adjusting your matcher rules.
Thank you for the detailed response @ines. Much appreciated
Whether this is good or bad really depends on your use case: sometimes, it can be very helpful to pre-label certain entities you know are always going to be correct. This can prevent the model from making certain mistakes, because the most likely interpretation of a given text will change if one or more entities are already set, and it also won't predict anything that conflicts with your existing matches.
In my case, I expect the component preceding the ner component to consist of "golden" phrases. I expect them to be always correct and unambiguous.
However, there can also be cases where this leads to problems. For instance, imagine you have a phrase "ACME Cancer Center", which you want to label as HOSPITAL , but your patterns contain an entry for [{"lower": "cancer"}] . In that case, "Cancer" would be pre-labeled and the model would never get to consider "ACME Cancer Center" as an entity, even if this would otherwise be the most confident analysis. So if you're setting entities before the entity recognizer, you typically want those to be very specific and as unambiguous as possible (e.g. matching "ACME Cancer Center" would be a lot better here).
I also have other components placed after the ner component in the pipeline. For example, there is an EntityRuler component that loads patterns from disk with overwrite=False. I use the components after the ner as a fall back incase the model misses some entities.
The best way to find out what works best is to run separate evaluations with your different pipeline configurations (NER only, NER + matcher, matcher + NER and so on). You can use spacy evaluate or just write a simple Python script that runs the model over your evaluation data and compares the entity spans.