Nice, glad to hear it works now!
This link points to a very old thread, so you probably want to look at more recent discussion, or the docs instead.
You always want to be annotating drug names in context – the model needs to see the full text, not just single words. This thread explains some more of the reasoning behind this, plus possible strategies. For example, you could create a patterns.jsonl
file that looks like this:
{"label": "DRUG", "pattern": [{"lower": "aspirin"}]}
{"label": "DRUG", "pattern": [{"lower": "aspirin"}, {"lower": "c"}]}
When you run ner.teach
, you can then stream in all of your data and set --patterns patterns.jsonl
, to tell Prodigy to select examples in your data that match the patterns (so you can say yes or no to them).
Another suggestion: If possible, try to make sure that your data includes a lot of other non-cyrillic spans that are not DRUG
entities. You don't want your model to learn that "every span consisting of latin characters is a drug".
Where do these examples come from? Did you create them manually? Because entity spans are usually annotated as character offsets ("start"
and "end"
), so the first example here labels the character "э"
, instead of the full token "эднит"
.
If you're running ner.teach
and the model suggest only partial spans, you should hit reject. This way, you're telling the model "nope, try again!". If you want your model to learn that the correct entity is "aspirin c forte", this is pretty important. Here's some more background on this: