Hi! There are a few things here – first, there's a small problem with your patterns:
{"label":producto","pattern":[{"lower":"panecillo sin gluten (producto)"}]}
There's a "
missing before producto
, which would make the line invalid JSON. Also, each dict in the pattern is supposed to describe one token. So "panecillo sin gluten (producto)"
would try to match a token whose exact value is that string, which would likely never be true. That phrase would probably be split into several tokens. You can find more examples and background on token-based patterns here.
Another thing to consider: The phrases you have in your patterns are incredibly specific. I'm not sure how helpful they'd be if you're using them directly to find examples and candidates in your data. "furosemida, 10 mg/ml, solucion inyectable, ampolla de 5 ml (producto)" will find you this exact phrase. I'm no domain expert, so I don't know – but how common is this exact phrase really going to be in your data?
The idea of patterns is to help you find relevant examples for annotation. Something like [{"is_digit": True}, {"lower": "ml"}]
for instance could be a useful pattern to find quantities like "10 ml" or "5ML". "Schistosoma"
could be a pattern, too, if you're looking for medical terminology.
None of the strings in your patterns are things that would typically be considered "named entities", so if you tried to train a model on those types of phrases in context, you probably wouldn't see very good results.
It might make sense to take a step back here and ask yourself: What exactly are you trying to achieve? What do you want your system to produce? Do you want to extract medical terminology? Do you want to map incoming texts to unique Snomed identifiers?
Training a statistical model can be useful if you want to be able to generalise based on examples of mentions in context. For example, if you wanted to train a system to recognise drug names, you could show it lots of examples of drug names mentioned in different contexts. Highlighting those drug names in your data is tedious, so if you already have a dictionary of drug names, you can convert it to patterns and use it to pre-select those names so you have to do it all by hand. At the end of it, you'll then have a large dataset of texts and the containing entity spans.
If you have large dictionaries of terms and you want to extract them from text, you might find that a rule-based approach actually works much better. It's more predictable and can achieve very comparable accuracy. You might want to check out spaCy's new EntityRuler
for this: Rule-based matching · spaCy Usage Documentation
Especially if you're new to NLP, starting with a rule-based approach might really be a good idea. It'll give you the quickest results and you'll be able to get a good feeling for your data and for what's easy vs. what's difficult. Once you have a set of rules in place that works, you can always experiment with a statistical model later on, and use your existing rules to bootstrap it.