HI,
I’m trying to make model for entity recognition from ‘cold start’, for Russian language. But something doing wrong during work flow, tried to study forum several hours but failed to find solution.
My config:
spaCy version 2.0.11
Location /home/di/anaconda3/lib/python3.6/site-packages/spacy
Platform Linux-4.15.0-29-generic-x86_64-with-debian-buster-sid
Python version 3.6.6
Models en, xx, xx_ent_wiki_sm, en_core_web_sm
Prodigy 1.5.1
My work flow:
- Make black model and JSONL with text
import spacy
from spacy.lang import ru
#make blank model with russian language
model_nlp = spacy.blank('ru')
model_nlp.add_pipe(model_nlp.create_pipe('ner'))
model_nlp.begin_training()
model_nlp.to_disk('model-ner-ru-drugs')
#convert text to JSONL format
from spacy.matcher import PhraseMatcher
nlp = spacy.load('model-ner-ru-drugs')# load a blank Russian class
texts = list(df_scu_new.name_temp_st2.unique())
'''
['викс актив амбромед 155мл 120мл сироп',
'иммуноглобулин против клещевого инцифалита 1:80 1мл №10 ампулы',
'натрия хлорид 0.9% 500мл №10 раствор инфузий пэт хемофарм',
'силкарен 0.1% 15г крем алтайвитамины',...]
'''
examples = [] # store annotation examples here
for text in texts:
doc = nlp(text)
matches = matcher(doc)
examples.append({'text': text})
#wright model to jsonl file
#do i need to use json.dumps(line, ensure_ascii=False) to save correct?
from pathlib import Path
Path('spacy_drug_names.jsonl').open('w', encoding='utf-8').write('\n'.join([json.dumps(line) for line in examples]))
JSONL file looks like this:
{"text": "\u0432\u0438\u043a\u0441 \u0430\u043a\u0442\u0438\u0432 \u0430\u043c\u0431\u0440\u043e\u043c\u0435\u0434 155\u043c\u043b 120\u043c\u043b \u0441\u0438\u0440\u043e\u043f"}
{"text": "\u0438\u043c\u043c\u0443\u043d\u043e\u0433\u043b\u043e\u0431\u0443\u043b\u0438\u043d \u043f\u0440\u043e\u0442\u0438\u0432 \u043a\u043b\u0435\u0449\u0435\u0432\u043e\u0433\u043e \u0438\u043d\u0446\u0438\u0444\u0430\u043b\u0438\u0442\u0430 1:80 1\u043c\u043b \u211610 \u0430\u043c\u043f\u0443\u043b\u044b"}
- Annotate using ner.manual
prodigy ner.manual data-ner-ru-drugs model-ner-ru-drugs spacy_drug_names.jsonl --label "DRUG"
everything fine
- train model with ner.batch-train 1000 records
prodigy ner.batch-train data-ner-ru-drugs model-ner-ru-drugs
Correct 195
Incorrect 4
Baseline 0.181
Accuracy 0.980
- Export patterns
prodigy terms.to-patterns data-ner-ru-drugs prodigy_drug_patterns.jsonl
✨ Exported 1001 patterns
prodigy_drug_patterns.jsonl
label has value : null #don’t understan
{"label":null,"pattern":[{"lower":"\u0432\u0438\u043a\u0441 \u0430\u043a\u0442\u0438\u0432 \u0430\u043c\u0431\u0440\u043e\u043c\u0435\u0434 155\u043c\u043b 120\u043c\u043b \u0441\u0438\u0440\u043e\u043f"}]}
- trying ner.teach but got error
prodigy ner.teach data-ner-ru-drugs model-ner-ru-drugs data-ner-ru-drugs.jsonl --label DRUG
ERROR: Can't find label 'DRUG' in model model-ner-ru-drugs
- trying export data base. it seems everything is ok here
prodigy db-out ner_drugs_russian ner_drugs_russian.jsonl
{"text":"\u0432\u0438\u043a\u0441 \u0430\u043a\u0442\u0438\u0432 \u0430\u043c\u0431\u0440\u043e\u043c\u0435\u0434 155\u043c\u043b 120\u043c\u043b \u0441\u0438\u0440\u043e\u043f","_input_hash":1397428549,"_task_hash":698327807,"tokens":[{"text":"\u0432\u0438\u043a\u0441","start":0,"end":4,"id":0},{"text":"\u0430\u043a\u0442\u0438\u0432","start":5,"end":10,"id":1},{"text":"\u0430\u043c\u0431\u0440\u043e\u043c\u0435\u0434","start":11,"end":19,"id":2},{"text":"155\u043c\u043b","start":20,"end":25,"id":3},{"text":"120\u043c\u043b","start":26,"end":31,"id":4},{"text":"\u0441\u0438\u0440\u043e\u043f","start":32,"end":37,"id":5}],"spans":[{"start":0,"end":19,"token_start":0,"token_end":2,"label":"DRUG"}],"answer":"accept"}