@honnibal, I did not try using the disk yet. I will do that after this message.
I have read a lot about NLP and similar tasks but I did not practice much yet. Therefore to be sure I do things right more or less, I just summarise what I am doing. I used those w2v models: http://bio.nlplab.org/ that can be downloaded here http://evexdb.org/pmresources/vec-space-models/
I have tried with a smaller pubmed w2v bin too. I have transformed them both into spacy models as follow:
from gensim.models import KeyedVectors
import spacy
w2v = KeyedVectors.load_word2vec_format('PubMed-w2v.bin', binary=True)
# w2v = KeyedVectors.load_word2vec_format('wikipedia-pubmed-and-PMC-w2v.bin', binary=True)
nlp = spacy.load("en_core_web_sm", vectors=False)
for word in w2v.wv.vocab:
nlp.vocab.set_vector(word, w2v.wv.word_vec(word))
nlp.to_disk('pubmed_w2v')
# nlp.to_disk('wp_pubmed_pmc_w2v')
The first folder that has been generated is 2.3G the other one 5G
I have prepared a list of patterns that contains abbreviations, single and multi words diseases
{"label":"DISEASE","pattern":[{"lower":"asthma"}]}
{"label":"DISEASE","pattern":[{"lower":"acute"},{"lower":"bronchitis"}]}
{"label":"DISEASE","pattern":[{"lower":"acute"},{"lower":"respiratory"},{"lower":"distress"},{"lower":"syndrome"}]}
{"label":"DISEASE","pattern":[{"lower":"ards"}]}
I did not yet try to use shapes as per @ines suggestion for the abbreviations
I then tried to annotate a 15k medical abstracts text
{"text": "Severe chronic obstructive pulmonary disease (COPD) is a progressive and debilitating illness characterised by relentless loss of function, intensifying dyspnoea and frequent exacerbations. COPD patients are evidently at increased risk of depression, frailty and death [1, 2]. Predicting individual short-term prognosis and course of events is difficult if not impossible.\n\nAdvance care planning should be part of our clinical routine in severe COPD <http://ow.ly/Cshs30i8FS9>"}
{"text": "The management of idiopathic pulmonary fibrosis (IPF) is complex, as is the process of implementing and assessing a set of quality indicators representing best care practices in IPF by an interstitial lung disease (ILD) programme [1, 2]. To date, there is limited literature documenting the importance of IPF interventions to improve coordination of care, patient engagement in health literacy and education, and understanding what is important to patients [3\u20138]. In 2015, National Jewish Health (NJH) engaged our ILD division healthcare professionals (10 physicians, 4 nurses, 2 medical assistants, 1 physician assistant) and our professional education and biostatistics teams to design and implement a project aimed at measuring key quality indicators and how they may impact clinical practice and IPF patient perception of care.\n\nA successful initiative to improve best care practice in IPF supported by electronic medical record changes <http://ow.ly/ORxi30hBEmy>\n\nThe authors are grateful for the support provided by the interstitial lung disease team at National Jewish Health."}
I ran the following command:
prodigy ner.teach diseases_ner pubmed_w2v journal_abstract_training_data.jsonl --label DISEASE --patterns diseases_terms.jsonl
or with the bigger model, both of them do the buffer exeption (by the way, the near.teach
receipe, does not make direct use of the to_bytes()
method it is the line 86 EntityRecognizer and I do not know how to override this one as I cannot read the source. (or can I?) )
Then I tried that command with the en
, en_core_web_sm
and en_core_web_lg
This seems to work a little as my diseases, and abreviations are really well matched. Here the problem is that, in the best case, I could do only around 80 examples and I went as far as 43% in the progression bar. Then prodigy is telling me that there are no more examples. If I restart, I get the same examples (I tried many times, I kind of recognise the articles prodigy shows me now.) But anyway I tried to move forward and I did a batch train.
prodigy ner.batch-train diseases_ner_test3 pubmed_w2v --output diseases --label DISEASE --eval-split 0.2 --n-iter 8 --batch-size 6
as you suggest in the video, I also increased the batch-size as I saw that I could train a little more, but I have very few examples I get things like:
Loaded model en_core_web_lg
Using 20% of accept/reject examples (7) for evaluation
Using 100% of remaining examples (29) for training
Dropout: 0.2 Batch size: 8 Iterations: 8
BEFORE 0.000
Correct 0
Incorrect 7
Entities 14
Unknown 0
# LOSS RIGHT WRONG ENTS SKIP ACCURACY
01 11.629 0 7 13 0 0.000
02 8.879 1 6 18 0 0.143
03 10.857 3 4 20 0 0.429
04 7.192 3 4 18 0 0.429
05 8.207 4 3 17 0 0.571
06 6.249 5 2 26 0 0.714
07 5.691 6 1 21 0 0.857
08 4.769 6 1 27 0 0.857
Correct 6
Incorrect 1
Baseline 0.000
Accuracy 0.857
The accuracy is indeed not bad, I have given some text to spacy NER, and it does match my diseases, but the NER model is quite broken as and, the and others are labelled as WORK_OF_ARTS etc.
I have noticed that the en_core models are 300 dimensions vectors, those that I downloaded are 200 would that make a difference? Did I do something wrong? Tank you for your help!
Sam