Trying to teach NER from blank model for Russian language


I’m trying to make model for entity recognition from ‘cold start’, for Russian language. But something doing wrong during work flow, tried to study forum several hours but failed to find solution.

My config:

    spaCy version      2.0.11
    Location           /home/di/anaconda3/lib/python3.6/site-packages/spacy
    Platform           Linux-4.15.0-29-generic-x86_64-with-debian-buster-sid
    Python version     3.6.6
    Models             en, xx, xx_ent_wiki_sm, en_core_web_sm
    Prodigy 1.5.1

My work flow:

  1. Make black model and JSONL with text
import spacy
from spacy.lang import ru

#make blank model with russian language 
model_nlp = spacy.blank('ru')

#convert text to JSONL format
from spacy.matcher import PhraseMatcher
nlp = spacy.load('model-ner-ru-drugs')# load a blank Russian class

texts = list(df_scu_new.name_temp_st2.unique())
['викс актив амбромед 155мл 120мл сироп',
 'иммуноглобулин против клещевого инцифалита 1:80 1мл №10 ампулы',
 'натрия хлорид 0.9% 500мл №10 раствор инфузий пэт хемофарм',
 'силкарен 0.1% 15г крем алтайвитамины',...]

examples = []  # store annotation examples here

for text in texts:
    doc = nlp(text)
    matches = matcher(doc)
    examples.append({'text': text})

#wright model to jsonl file
#do i need to use json.dumps(line, ensure_ascii=False) to save correct?

from pathlib import Path
Path('spacy_drug_names.jsonl').open('w', encoding='utf-8').write('\n'.join([json.dumps(line) for line in examples]))

JSONL file looks like this:

{"text": "\u0432\u0438\u043a\u0441 \u0430\u043a\u0442\u0438\u0432 \u0430\u043c\u0431\u0440\u043e\u043c\u0435\u0434 155\u043c\u043b 120\u043c\u043b \u0441\u0438\u0440\u043e\u043f"}
{"text": "\u0438\u043c\u043c\u0443\u043d\u043e\u0433\u043b\u043e\u0431\u0443\u043b\u0438\u043d \u043f\u0440\u043e\u0442\u0438\u0432 \u043a\u043b\u0435\u0449\u0435\u0432\u043e\u0433\u043e \u0438\u043d\u0446\u0438\u0444\u0430\u043b\u0438\u0442\u0430 1:80 1\u043c\u043b \u211610 \u0430\u043c\u043f\u0443\u043b\u044b"}
  1. Annotate using ner.manual
prodigy ner.manual data-ner-ru-drugs model-ner-ru-drugs spacy_drug_names.jsonl --label "DRUG"

everything fine

  1. train model with ner.batch-train 1000 records
prodigy ner.batch-train data-ner-ru-drugs model-ner-ru-drugs

Correct    195
Incorrect  4
Baseline   0.181     
Accuracy   0.980 
  1. Export patterns
prodigy data-ner-ru-drugs prodigy_drug_patterns.jsonl

✨  Exported 1001 patterns

label has value : null #don’t understan

{"label":null,"pattern":[{"lower":"\u0432\u0438\u043a\u0441 \u0430\u043a\u0442\u0438\u0432 \u0430\u043c\u0431\u0440\u043e\u043c\u0435\u0434 155\u043c\u043b 120\u043c\u043b \u0441\u0438\u0440\u043e\u043f"}]}
  1. trying ner.teach but got error
prodigy ner.teach data-ner-ru-drugs model-ner-ru-drugs data-ner-ru-drugs.jsonl --label DRUG

ERROR: Can't find label 'DRUG' in model model-ner-ru-drugs
  1. trying export data base. it seems everything is ok here
prodigy db-out ner_drugs_russian ner_drugs_russian.jsonl
{"text":"\u0432\u0438\u043a\u0441 \u0430\u043a\u0442\u0438\u0432 \u0430\u043c\u0431\u0440\u043e\u043c\u0435\u0434 155\u043c\u043b 120\u043c\u043b \u0441\u0438\u0440\u043e\u043f","_input_hash":1397428549,"_task_hash":698327807,"tokens":[{"text":"\u0432\u0438\u043a\u0441","start":0,"end":4,"id":0},{"text":"\u0430\u043a\u0442\u0438\u0432","start":5,"end":10,"id":1},{"text":"\u0430\u043c\u0431\u0440\u043e\u043c\u0435\u0434","start":11,"end":19,"id":2},{"text":"155\u043c\u043b","start":20,"end":25,"id":3},{"text":"120\u043c\u043b","start":26,"end":31,"id":4},{"text":"\u0441\u0438\u0440\u043e\u043f","start":32,"end":37,"id":5}],"spans":[{"start":0,"end":19,"token_start":0,"token_end":2,"label":"DRUG"}],"answer":"accept"}
1 Like

Hi! Your workflow mostly looks good – I think there are just a few small misunderstandings and situations where your workflow is a little too complicated and could be simpler :slightly_smiling_face:

What exactly is in your spacy_drug_names.jsonl? Russian texts containing drug names?

Also, when you get to this point:

The recipe (see here) expects a dataset created with terms.teach – which will consist of single words and a label. The idea is that you can use word vectors to find similar terms, say yes or no to them and then let Prodigy automatically create patterns for you. For example:

{"text": "fentanyl", "answer": "accept"}

… would be come the following patterns if you run with --label DRUG:

{"label": "DRUG", "pattern": [{"lower": "fentanyl"}]}

However, in your case, you’re using the recipe on a dataset with manual NER annotations, so Prodigy will think that the whole text describes a pattern. You also didn’t pass in a label, so it ends up as null (None).

I’m not 100% sure what you were trying to do with the patterns step here. Patterns are useful if you want to train a new category from scratch, and don’t have any labelled data yet (only examples of the entities). The patterns then help Prodigy pre-select more examples, to make sure the model learns enough in the beginning. If you’ve already annotated examples by hand and you have enough data to pre-train the model, you can leave out this step.

I think the problem here is that you’re loading in the blank Russian model, not the one you’ve already pre-trained in the step before. So Prodigy complains, because the model doesn’t yet know the label DRUG. If you set an output path when you train the model from your first dataset, Prodigy will save it out to a directory:

prodigy ner.batch-train data-ner-ru-drugs model-ner-ru-drugs --output /path/to/output_model

You can then use that pre-trained model in ner.teach to improve it, by passing in the path to the exported model directory:

prodigy ner.teach data-ner-ru-drugs /path/to/output_model data-ner-ru-drugs.jsonl --label DRUG
1 Like

Hi Ines,

Thank you for advises! ner.batch-train works now! You are correct, I used blank model. I didn’t understand it from user manual, my suggestion to high light it, at first run must use --output setting with model path.

Regarding JSONL file:
text: is string of product description from drug store. Includes DRUG name, volume, percent of active component, product form, manufacture etc. It looks like:

{“text”:“эднит 20мг №28 таблетки гедеон рихтер”}
{“text”:“зитазониум 20мг №30 таблетки”}
{“text”:“aspirin №30 таблетки”}
{“text”:“aspirin c 20мг №10 таблетки”}
{“text”:“aspirin c forte 500мг №1 таблетки”}
{“text”:“aspirin c double effect 2mg №15 таблетки”}

also I have a list, with DRUG names, [‘эднит’, ‘зитазониум’, ‘aspirin’, ‘aspirin c forte’, ‘aspirin c forte’ , etc] i can put them into JSONL as label, according to link
but what next step is correct?

{“text”:“эднит 20мг №28 таблетки гедеон рихтер”,“spans”:[{“start”:0,“end”:1,“label”:“DRUG”}]}
{“text”:“зитазониум 20мг №30 таблетки”,“spans”:[{“start”:0,“end”:1,“label”:“DRUG”}]}

Also during ner.batch-train after manual annotation, have some confusion how proceed. Several DRUG names are single token and some are multi tokens:

aspirin c 20мг №10 таблетки - predict [aspirin c] - my turn [YES]
aspirin c forte 500мг №1 таблетки - predict [aspirin c] - my turn [what is better to use NO or SKIP?]

Nice, glad to hear it works now!

This link points to a very old thread, so you probably want to look at more recent discussion, or the docs instead.

You always want to be annotating drug names in context – the model needs to see the full text, not just single words. This thread explains some more of the reasoning behind this, plus possible strategies. For example, you could create a patterns.jsonl file that looks like this:

{"label": "DRUG", "pattern": [{"lower": "aspirin"}]}
{"label": "DRUG", "pattern": [{"lower": "aspirin"}, {"lower": "c"}]}

When you run ner.teach, you can then stream in all of your data and set --patterns patterns.jsonl, to tell Prodigy to select examples in your data that match the patterns (so you can say yes or no to them).

Another suggestion: If possible, try to make sure that your data includes a lot of other non-cyrillic spans that are not DRUG entities. You don't want your model to learn that "every span consisting of latin characters is a drug".

Where do these examples come from? Did you create them manually? Because entity spans are usually annotated as character offsets ("start" and "end"), so the first example here labels the character "э", instead of the full token "эднит".

If you're running ner.teach and the model suggest only partial spans, you should hit reject. This way, you're telling the model "nope, try again!". If you want your model to learn that the correct entity is "aspirin c forte", this is pretty important. Here's some more background on this: