Advice on training NER models with new entities

usage
ner

(Egzon Syka) #1

I want to train an NER model which I will use to recognize 3 entities (Company(or ORG), time period(possibly DATE), and LOCATION) on lines of text extracted from peoples CVs, mainly of experience section. So, I would like to hear your opinion on this, is a good idea to just start with ner.teach recipe and one of the datasets (e.g. en_core_web_lg) or do I need first to train some terms like the company and feed with IT companies? Should I introduce a new term COMPANY or go with the standard ORG?
And finally, for the period of time, I need something like e.g. 12.2012 - X.2018 to be recognized as a period of time and also ‘1 year and 6 months’ also to be recognized, should we go here with the DATE entity or train a new one?

This was the answer of Matthew on a direct exchange via email before I knew about this forum:

To answer your question, problems do differ, so it’s hard to tell whether ner.teach will be best. I would say using the ner.make-gold recipe to get an evaluation set will be a good first step. Then you can check quality of the current model on your data, and as you try different ways of improving the accuracy, you’ll have a repeatable experiment.

I think training a new class for the periods of time will be useful, as otherwise you’ll conflict with the DATE definition in subtle ways. Note that ranges of time are actually very complex! You might end up needing to recognise the start and end point separately.

Since I have additional questions and also community might benefit from this discussion I decided to post them here.

  1. I used the ner.make-gold recipe with all the entities(ORG, DATE, GPE, LOC, POSITION) and in return I’ve got a dataset with 2000 annotation and then using the ner.batch-train with the gold dataset I’ve got an accuracy of 67%, but I did not understand how to use it as an evaluation set and proceed further. During the annotation I got the impression that the model was doing good with the range of dates and many times got them right but with the POSITION not that much and many times it failed even in the same situation(perhaps because at this point I didn’t add a new entity that I will talk below on the second question)

  2. I wanted to add another entity like ROLE (or POSITION) and train it using the terms.teach recipe with word vectors (e.g. a large spaCy model) to create a terminology list of examples of the new entity type and for example start off with the seed terms: “project manager”, “systems analyst”, “software engineer”, “data engineer” etc. I followed the lecture “Training a new entity type on Reddit comments” where you trained the DRUG entity, but in my case, I keep getting only one-word suggestions that are relevant but I would expect also two-word suggestions.

Here are some example of the data that we use:
{“text”: “Master of Business (Strategic management) with multidisciplinary skills and over 15 years experience”}
{“text”: “in Financial Services. Strong background in project management (PRINCE2-certified) and”}
{“text”: “requirements engineering.”}
{“text”: “Currently working toward certification as Data Protection Officer (DPO) EU GDPR.”}
{“text”: “Januar 2017 - Present”}
{“text”: “Requirements Engineer AEI /Tax Reporting at Credit Suisse”}

Thanks


(Matthew Honnibal) #2

Hi Egzon,

Thanks for posting this here, it’s definitely useful if the discussions are publicly viewable.

You can pass an evaluation dataset to the ner.batch-train command using the --eval-id argument. You’ll need to have a separate data set that you use for training, as well.

The default vectors in en_vectors_web_lg only have single-word entries, so you need some other way to add multi-word entities to the text. If you train a model with terms.train-vectors, you can use the merge entities and merge NPs arguments, which will give you multi-word vectors. Alternatively, you can also edit the patterns.jsonl file that terms.to-patterns produces. This lets you add your own patterns to it. For instance, you might suggest as a pattern any sequence of capitalised words, or any two-word phrase that starts with a particular word, etc. You can read more about the rule-based matching syntax here: https://spacy.io/usage/linguistic-features#section-rule-based-matching


(Egzon Syka) #3

Thank you so much for the speedy reply.

Regarding using terms.train-vectors I saw in the recipes that it requires like 10 million words and when I tried your suggestion to use --merge-ents and --merge-nps arguments (as below) maybe because I don’t have even close to the required amount of words (I have at 10k words) it didn’t show me any two-word suggestions.

prodigy terms.train-vectors prodigy_models/exp_model work_experience1.jsonl --spacy-model en_core_web_sm --merge-nps --merge-ents

Do you think that the problem is at small input set and if Yes, could it work reliably with e.g. 100k of words cause for now I don’t think I will manage to have more than that amount?


(Matthew Honnibal) #4

What type of text are you working with? Just find another source of text that’s at least a little bit related. 10,000 words is definitely too little to train vectors, you’ll be better off even just using text from Wikipedia, if nothing else.


(Egzon Syka) #5

I’m working with CV’s of the IT field, I found another source of text, but now I have another question.
The text is extracted from PDF documents (using pdfminer library), meaning that it is extracted in lines, so what would be the best input format for the terms.train-vectors recipe: joining them in a single long string or pass them like lines extracted from PDF sometimes without context?


(Matthew Honnibal) #6

I don’t think the word2vec process is really sensitive to sentences, so it shouldn’t matter so much if you merge whole documents into one line. You might want to check that there are no sub-document logical elements to separate the text into, though. For instance, paragraphs, sections etc might be preserved as a sentence-like unit. This would be useful for further processing, as it’ll save you from displaying a confusing wall of text when you go to do the annotation.


(Egzon Syka) #7
  1. Am I wrong that the purpose of training your own vectors/model using the terms.train-vectors recipe is to create terminology lists and patterns from vectors(terms.teach and terms.to-patterns) that later you can use in ner.teach recipe?
    - If you have a list of terms and patterns to use then such a step would be unnecessary?

  2. Based on the Cookbook section Creating gold-standard annotations to annotate from scratch entities that are not present in the model(which is my case) we need to use ner.manual recipe instead of ner.make-gold?

  3. When you create gold-standard annotation using (ner.make-gold or ner.manual) should we do it on the same dataset that we use ner.teach and with the same number of annotations in order to use it as an evaluation set?


(Ines Montani) #8

Yes, that’s correct!

(One quick note on vectors: If you do end up with good vectors for your domain, using them in the base model can sometimes improve accuracy. If you’re training a spaCy model and vectors are available in the model, they’ll be used during training.)

Yes, that’s correct. ner.make-gold can only pre-highlight entities that are predicted by the model, so this only works if the model already knows the entity type. If some of your entitiy types are already present in the model and others aren’t, you could also combine the two recipes: start by annotating the existing labels with ner.make-gold, export the data, load it into ner.manual and add the new labels on top. How you do that depends on what’s most efficient for your use case.

Training and evaluation examples should ideally be drawn from the same data source, yes. The examples should also be representative of what your model will see at runtime – for example, if you’re processing short paragraphs at runtime, you also want to evaluate the model on short paragraphs (and not, say, short sentences only). Also double-check that there’s no overlap between the training and evaluation examples – even single examples can often lead to pretty distorted results. 20-50% of the number of training examples is usually a good amount – if you have under a thousand examples for evaluation, you might have to take the evaluation results with a grain of salt.