I've a small data set of 10 data points on "Lipitor 5mg" (same as Atorvastatin, but branded), 5 data points on "Atorvastatin 10mg", and 5 on "Atorvastatin 20mg". When I do a regular term.teach they are all identified as drug, but not as a statin since I have too few data points.

{"label": "DRUG", "patterns": [{lower: "Lipitor"}, {"lower": 5mg}]}
{"label": "DRUG", "patterns": [{lower: "Atorvastatin"}, {"lower": 10mg}]}
{"label": "DRUG", "patterns": [{lower: "Atorvastatin"}, {"lower": 20mg}]}

I like to map all drugs to the single entry Atorvastatin in the vocabulary that is used further in the analysis. I tried

{"name": "Atorvastatin", "label": "DRUG", "patterns": 
[{lower: "Lipitor"}, {"lower": 5mg}],
[{lower: "Atorvastatin"}, {"lower": 10mg}],
[{lower: "Atorvastatin"}, {"lower": 20mg}],

but I don't know if I get what I need. I could not find a smart way to verify it. In the ideal case I have a sentence where I have only the drug Atorvastatin, nothing else




I hope I understand the question correctly. But I think you probably want to focus on training your model to recognise DRUG (any drug) until it's reasonably good at it. You can then add a rule-based component on top later that normalises them and groups them into subtypes. I actually outlined a very similar approach here: