Train multiple NER from a blank FR model using fastext vectors

Hi there,

I’m new in the ML world and try to understand some of the “basics” in spacy and Prodigy.

I’m developing a application for the HR and need to train a model to be able to identify entities in documents (only resumes, due to the specific format, I think that training a more “general” model will be hard to have a good accuracy)

But actually I’m not sure about the workflow, I’ve tried several things and have some fails during training (low f-score, or model forget previous entities...).

Can you tell me if this workflow is ok ?

1 Download fastext FR vectors

2 use python -m spacy init-model fr ./models/kmj_vectors_fr --vectors-loc ../../resources/cc.fr.300.vec.gz to create a blank model with word vectors

3 use spacy to add missing parser/Tagger/net from fr_core_news_md (without this step it seem that I can’t train ner with complexe patterns in prodigy)

base_nlp = spacy.load("fr_core_news_md")
tagger = base_nlp.get_pipe("tagger")
parser = base_nlp.get_pipe("parser")
ner = base_nlp.get_pipe("ner")

nlp = spacy.load('./models/kmj_vectors_fr')
nlp.add_pipe(tagger)
nlp.add_pipe(parser)
nlp.add_pipe(ner)
nlp.to_disk('./models/kmj_model_fr')

4 use prodigy ner.manual with a pattern file to create the DEGREE entity using a jsonl (5k resumes as raw text), using degree_data as dataset
prodigy ner.manual degree_data ./models/kmj_model_fr ./tools/resumes_to_datasource/datasource.jsonl --label DEGREE --patterns degree_patterns.jsonl

5 use prodigy train to train the model on this entity
prodigy train ner degree_data ./models/kmj_model_fr --eval-split 0.2

First question : is this workflow correct ?

Second question :

If I need to add a new entity, let say “DATE”, do I have to use the same input jsonl or I can use any text I want(ex reddit, tsv..) ?

If I try to use ner.train to begin annotation of the DATE entity, prodigy train train the new entity, but erase others, I heard that it’s the “catastrophic forgotten problem”, but I’m not sure to understand what the problem. Can you help me to understand what I’m doing wrong ?

Thanks for your help

EDIT 1:
As i understand, init-model is used to add vectors to a blank model. Then if I need to optimize this to my specific context, I can use pretrain on my text with python -m spacy pretrain ./tools/resumes_to_datasource/datasource.jsonl ./models/fr_vectors_lg ./models/fr_resumes_pretrained --use-vectors, then I can use prodigy train ner... to train this pretrained model to my first NER.

For learning another NER, I'm going to tests some things (I'm actually pertaining my model) like train ner with both datasets and --label DEGREE,DATE but I'm not sure this solution is the right way

Hi! Your workflow looks good, yes :smiley:

Do you actually need the existing entity recognizer of the French base model, though? Adding entity types to an existing models can make it more difficult to reason about the results and evaluate your model. So if you don't need the existing entity types, I'd start off without an entity recognizer in the base model. If you want to keep one or two, you could use a workflow like ner.correct and have the pretrained model help label those entities for you. This is "cleaner" and gives you data to train on from scratch.

In theory, that's possible – but it's important that you specify that all unannotated tokens are missing values when you train your model (e.g. by setting the --ner-missing flag). If you're not doing that, your DATE annotations will be interpreted as "annotated tokens are DATE, all other tokens are outside an entitiy". This isn't necessarily true, because your data may also include lots of annotations with your other entity types. So if you tell the model that those aren't entities, it will try to stop predicting them.

Catastrophic forgetting could still occur if you're updating your model with new examples but are not including enough examples of what the model previously got right. So that's another reason why it can be beneficial to train from scratch and use a pretrained model to help you label the data instead of updating it. So if you can do that, I think that'd be a good solution for your use case, and it'd give you a standalone gold-standard corpus to train on that doesn't depend on other resources :slightly_smiling_face:

Hi, thanks for this help !

In fact French model use a MISC entity label which is, in my opinion, useless, because it mean nothing and is unusable as is. That’s a reason why I want to train my own model.

But there is so much to do, if I create my own model, I lose all POS and DEP(not sure about DEP) which is not a problem for just parsing a resime, but the next steps I want to do is TextCategorizer, and I think I will need it.
For the moment I try a new ner training on the fr_core_news_md with ner.manual to completely override the model default entities and set mine (basically added mine + GPE,DATE,ORG I have to retrain, and don’t make any train on MISC)

I was only suggesting that you train the entity recogmizer from scratch – you can keep the parser and tagger components, I definitely wouldn't recommend training those from scratch.

So if you want to add your own categories, but keep GPE, DATE and ORG, you could use ner.correct with --label GPE,DATE,ORG,DEGREE and have the pretrained model highlight the entities it already knows for you. And then all you have to do is add your DEGREE entities and correct the model's mistakes.

Oh great thanks for this tips !!!

Just to know, the difference between ner.manual and ner.correct is that ner.correct use model predictions instead of using patterns, right ?

Yes, exactly :slightly_smiling_face:

Perfect thanks for all those helps.

Have much work to do now (my little finger hurt due to SHIT + 1-2-3 for NER :p) !!!

You can remap them to other keys if you want to and there's a combination you find more comfortable :sweat_smile: See here: https://prodi.gy/docs/api-web-app#actions-custom-labels

1 Like

Ho yeah, can we remap by entity order? in fact i want to be abble to select by just pressing 1,2,3 instead of SHIFT+1, SHIFT+2.

it seem that it was possible but in prodigy it just add label 1,2,3... to my text

my config :

{
"keymap_by_label": {
    "1": "&",
    "2": "é",
    "3": "\"",
    "4": "'",
    "5": "(",
    "6": "§",
    "7": "è",
    "8": "!",
    "9": "ç",
    "0": "à"
  }
}

(I'm using a French keyboard)

So the default should be just the number keys – like 1 for the first label, and so on.

If you want to assign custom keys for your NER labels, you should use the regular labels when you start the server (like DATE) and then have the config map those labels to keys. For instance, "DATE": "d" to map that label to the d key.

We chose this approach because more often, people want to map specific labels to specific keys. If you want to just do it in order, you could use a slightly modified version of the ner.manual recipe and add the keymap to the config returned by your recipe. And then do something like this to map them automatically based on the order:

keys_in_order = ["&", "é", "\\", "'"]  # etc.
keymap_by_label = {label_name: keys_in_order[i] for i, label_name in enumerate(label)}

return {
    # etc.
    "config": {"labels": label, "keymap_by_label": keymap_by_label}
}

Thanks for this little recipe, I understand completely the choice about mapping only by entity label.

I’m gonna look for recipes system, for the moment I’ve used direct label mapping, and used my mouse keyboard(gaming mouse) to tag entity, it’s a lot more fast without the need of a second hand ^^

Ah cool, that's next level! I've always liked the idea of Prodigy being "played" :sweat_smile:

(Slightly OT, but speaking of gaming: If you want the UI to match the gaming experience, here's a custom Prodigy skin that I built for fun, inspired by a user's comment: https://twitter.com/_inesmontani/status/1207354911341662210 :joy:)

1 Like

LoL, nice customized skin, but if I use it, my employee are just going to think I'm actually playing :smiley:

1 Like