Hi there,
I’m new in the ML world and try to understand some of the “basics” in spacy and Prodigy.
I’m developing a application for the HR and need to train a model to be able to identify entities in documents (only resumes, due to the specific format, I think that training a more “general” model will be hard to have a good accuracy)
But actually I’m not sure about the workflow, I’ve tried several things and have some fails during training (low f-score, or model forget previous entities...).
Can you tell me if this workflow is ok ?
1 Download fastext FR vectors
2 use python -m spacy init-model fr ./models/kmj_vectors_fr --vectors-loc ../../resources/cc.fr.300.vec.gz
to create a blank model with word vectors
3 use spacy to add missing parser/Tagger/net from fr_core_news_md (without this step it seem that I can’t train ner with complexe patterns in prodigy)
base_nlp = spacy.load("fr_core_news_md")
tagger = base_nlp.get_pipe("tagger")
parser = base_nlp.get_pipe("parser")
ner = base_nlp.get_pipe("ner")
nlp = spacy.load('./models/kmj_vectors_fr')
nlp.add_pipe(tagger)
nlp.add_pipe(parser)
nlp.add_pipe(ner)
nlp.to_disk('./models/kmj_model_fr')
4 use prodigy ner.manual with a pattern file to create the DEGREE entity using a jsonl (5k resumes as raw text), using degree_data as dataset
prodigy ner.manual degree_data ./models/kmj_model_fr ./tools/resumes_to_datasource/datasource.jsonl --label DEGREE --patterns degree_patterns.jsonl
5 use prodigy train to train the model on this entity
prodigy train ner degree_data ./models/kmj_model_fr --eval-split 0.2
First question : is this workflow correct ?
Second question :
If I need to add a new entity, let say “DATE”, do I have to use the same input jsonl or I can use any text I want(ex reddit, tsv..) ?
If I try to use ner.train to begin annotation of the DATE entity, prodigy train train the new entity, but erase others, I heard that it’s the “catastrophic forgotten problem”, but I’m not sure to understand what the problem. Can you help me to understand what I’m doing wrong ?
Thanks for your help
EDIT 1:
As i understand, init-model is used to add vectors to a blank model. Then if I need to optimize this to my specific context, I can use pretrain on my text with python -m spacy pretrain ./tools/resumes_to_datasource/datasource.jsonl ./models/fr_vectors_lg ./models/fr_resumes_pretrained --use-vectors
, then I can use prodigy train ner...
to train this pretrained model to my first NER.
For learning another NER, I'm going to tests some things (I'm actually pertaining my model) like train ner with both datasets and --label DEGREE,DATE but I'm not sure this solution is the right way