Advice on training NER models with new entities

I want to train an NER model which I will use to recognize 3 entities (Company(or ORG), time period(possibly DATE), and LOCATION) on lines of text extracted from peoples CVs, mainly of experience section. So, I would like to hear your opinion on this, is a good idea to just start with ner.teach recipe and one of the datasets (e.g. en_core_web_lg) or do I need first to train some terms like the company and feed with IT companies? Should I introduce a new term COMPANY or go with the standard ORG?
And finally, for the period of time, I need something like e.g. 12.2012 - X.2018 to be recognized as a period of time and also '1 year and 6 months' also to be recognized, should we go here with the DATE entity or train a new one?

This was the answer of Matthew on a direct exchange via email before I knew about this forum:

To answer your question, problems do differ, so it's hard to tell whether ner.teach will be best. I would say using the ner.make-gold recipe to get an evaluation set will be a good first step. Then you can check quality of the current model on your data, and as you try different ways of improving the accuracy, you'll have a repeatable experiment.

I think training a new class for the periods of time will be useful, as otherwise you'll conflict with the DATE definition in subtle ways. Note that ranges of time are actually very complex! You might end up needing to recognise the start and end point separately.

Since I have additional questions and also community might benefit from this discussion I decided to post them here.

  1. I used the ner.make-gold recipe with all the entities(ORG, DATE, GPE, LOC, POSITION) and in return I've got a dataset with 2000 annotation and then using the ner.batch-train with the gold dataset I've got an accuracy of 67%, but I did not understand how to use it as an evaluation set and proceed further. During the annotation I got the impression that the model was doing good with the range of dates and many times got them right but with the POSITION not that much and many times it failed even in the same situation(perhaps because at this point I didn't add a new entity that I will talk below on the second question)

  2. I wanted to add another entity like ROLE (or POSITION) and train it using the terms.teach recipe with word vectors (e.g. a large spaCy model) to create a terminology list of examples of the new entity type and for example start off with the seed terms: "project manager", "systems analyst", "software engineer", "data engineer" etc. I followed the lecture "Training a new entity type on Reddit comments" where you trained the DRUG entity, but in my case, I keep getting only one-word suggestions that are relevant but I would expect also two-word suggestions.

Here are some example of the data that we use:
{"text": "Master of Business (Strategic management) with multidisciplinary skills and over 15 years experience"}
{"text": "in Financial Services. Strong background in project management (PRINCE2-certified) and"}
{"text": "requirements engineering."}
{"text": "Currently working toward certification as Data Protection Officer (DPO) EU GDPR."}
{"text": "Januar 2017 - Present"}
{"text": "Requirements Engineer AEI /Tax Reporting at Credit Suisse"}

Thanks

Hi Egzon,

Thanks for posting this here, it's definitely useful if the discussions are publicly viewable.

You can pass an evaluation dataset to the ner.batch-train command using the --eval-id argument. You'll need to have a separate data set that you use for training, as well.

The default vectors in en_vectors_web_lg only have single-word entries, so you need some other way to add multi-word entities to the text. If you train a model with terms.train-vectors, you can use the merge entities and merge NPs arguments, which will give you multi-word vectors. Alternatively, you can also edit the patterns.jsonl file that terms.to-patterns produces. This lets you add your own patterns to it. For instance, you might suggest as a pattern any sequence of capitalised words, or any two-word phrase that starts with a particular word, etc. You can read more about the rule-based matching syntax here: Linguistic Features · spaCy Usage Documentation

Thank you so much for the speedy reply.

Regarding using terms.train-vectors I saw in the recipes that it requires like 10 million words and when I tried your suggestion to use --merge-ents and --merge-nps arguments (as below) maybe because I don’t have even close to the required amount of words (I have at 10k words) it didn’t show me any two-word suggestions.

prodigy terms.train-vectors prodigy_models/exp_model work_experience1.jsonl --spacy-model en_core_web_sm --merge-nps --merge-ents

Do you think that the problem is at small input set and if Yes, could it work reliably with e.g. 100k of words cause for now I don’t think I will manage to have more than that amount?

What type of text are you working with? Just find another source of text that’s at least a little bit related. 10,000 words is definitely too little to train vectors, you’ll be better off even just using text from Wikipedia, if nothing else.

I’m working with CV’s of the IT field, I found another source of text, but now I have another question.
The text is extracted from PDF documents (using pdfminer library), meaning that it is extracted in lines, so what would be the best input format for the terms.train-vectors recipe: joining them in a single long string or pass them like lines extracted from PDF sometimes without context?

I don’t think the word2vec process is really sensitive to sentences, so it shouldn’t matter so much if you merge whole documents into one line. You might want to check that there are no sub-document logical elements to separate the text into, though. For instance, paragraphs, sections etc might be preserved as a sentence-like unit. This would be useful for further processing, as it’ll save you from displaying a confusing wall of text when you go to do the annotation.

  1. Am I wrong that the purpose of training your own vectors/model using the terms.train-vectors recipe is to create terminology lists and patterns from vectors(terms.teach and terms.to-patterns) that later you can use in ner.teach recipe?
    - If you have a list of terms and patterns to use then such a step would be unnecessary?

  2. Based on the Cookbook section Creating gold-standard annotations to annotate from scratch entities that are not present in the model(which is my case) we need to use ner.manual recipe instead of ner.make-gold?

  3. When you create gold-standard annotation using (ner.make-gold or ner.manual) should we do it on the same dataset that we use ner.teach and with the same number of annotations in order to use it as an evaluation set?

Yes, that's correct!

(One quick note on vectors: If you do end up with good vectors for your domain, using them in the base model can sometimes improve accuracy. If you're training a spaCy model and vectors are available in the model, they'll be used during training.)

Yes, that's correct. ner.make-gold can only pre-highlight entities that are predicted by the model, so this only works if the model already knows the entity type. If some of your entitiy types are already present in the model and others aren't, you could also combine the two recipes: start by annotating the existing labels with ner.make-gold, export the data, load it into ner.manual and add the new labels on top. How you do that depends on what's most efficient for your use case.

Training and evaluation examples should ideally be drawn from the same data source, yes. The examples should also be representative of what your model will see at runtime – for example, if you're processing short paragraphs at runtime, you also want to evaluate the model on short paragraphs (and not, say, short sentences only). Also double-check that there's no overlap between the training and evaluation examples – even single examples can often lead to pretty distorted results. 20-50% of the number of training examples is usually a good amount – if you have under a thousand examples for evaluation, you might have to take the evaluation results with a grain of salt.

1 Like

Thank you for the detailed answers.

One quick note on vectors: If you do end up with good vectors for your domain, using them in the base model can sometimes improve accuracy.

Can you give me an example how could I use them because right now is unclear to me how could I use un-annotated word vectors in a model later? (Maybe point me to a post where that has been discussed)

Maybe this should be on a different topic: I have used the StanfordNLP NER with their pretrained vectors and I saw that they have much more entities on their pretrained NER models that are very useful(at least to me): TITLE, DURATION(e.g. 2018-2019), CITY, COUNTRY etc., and I am wondering could we use theirs and building on them in Prodigy?

From your experience, if we have to build an NER model that has to detect like 5 new entities that aren't on the pretrained models, how many annotations would be like the minimum to guarantee a solid performance(I did 3k using Prodigy but it wasn't even close)? Even though during the training I got the impression that it was actually learning based on the scores and suggestions...

Yes, see here for an example. spaCy comes with an init-model command that initializes a new model with existing vectors:

In general, yes. The easiest way would be to use a manual recipe like ner.manual to collect the annotations using the same label scheme as the pre-trained model. When you're done, you can then export your data and use it to update the model. Prodigy's format is pretty straightforward JSONL using character offsets for NER, so that should be pretty easy to convert to whichever other format you need.

If you want to annotate with a model in the loop, this might be a bit tricker. There are Python wrappers for CoreNLP that'd let you stream in predictions, but updating in the loop might not be easily possible. In order for this to work well, the model needs to be quite sensitive to small updates (but not too sensitive either), which is not usually how models like this are implemented. spaCy's models and Prodigy's wrappers for them are more optimised for that. And if you're implementing your own model in, say, PyTorch, that's also easier because you can develop it with that requirement in mind.

When you ran your experiments, did you use a blank model or did you start off with an existing pre-trained model? If you use a pre-trained model and add a lot of new categories, you might be fighting a lot of side-effects of the existing weights and trying to reconcile the new labels with what the model already knows. This is especially true if the new categories are potentially overlapping the old ones – for example, if you're annotating CITY and the model already predicts those entities as GPE, or if you're labelling DURATION but the model already predicts "2018" and "2019" separately as DATE.

Trying to teach an existing model a completely new definition of an entity isn't always very predictable. So if you're just labelling and training one of the categories separately, it might look good and the model converges. But once you put them all together with a pre-trained base model, the existing weights can interfere and you'll end up with much worse results overall.

So you might actually be better off starting from scratch, with your ideal label scheme. For this, I'd suggest maybe doing a few thousand gold-standard (sentence) annotations with all labels (e.g. using ner.manual – instead of annotating all labels at once, you can make several passes over the data and keep adding more labels as you go). Then, start off with a blank model, pre-train it with the data and see how you go. Once you have your custom pre-trained model, you can start ner.teach and see if you're able to improve the model with binary feedback. If it's not working well yet, you can go back, add more gold-standard annotations to your initial set, pre-train again and so on.

1 Like

Yes I've used en_core_web_lg:
prodigy ner.teach personal_info_ner en_core_web_lg data_complete.jsonl --label "EMAIL, ADDRESS, NAME, BIRTH_DATE, PHONE_NUMBER, SOCIAL_MEDIA" --patterns ~/[dir]/personal_info_patterns.jsonl
personal_info_patterns.jsonl (5.4 KB)

As you can see on the command above, I start immediately with all the categories together...is this a problem?

Oh, I wasn't familiar with this way of using ner.manual to start annotation except for the gold dataset.
So the workflow you are suggesting is:

  • Start with ner.manual recipe using only one label e.g. prodigy ner.manual [dataset] [spacy_model] [source] --label 'ADDRESS' .
    In the [spacy_model] should I use one of the pre-trained models like en_core_web_sm/md/lg or use the model that I get from recipe terms.train-vectors?

  • Train using: ner.batch-train [dataset] [spacy_model or generated from terms.train-vectors] --output modelX

  • ner.teach [dataset] [spacy_model or terms.train-vectors output or modelX] [source] --label ALL_LABELS [WITH OR WITHOUT PATTERNS]

  • If again not working go back to step one.

Is this right?

P.S. I can't remember a reason why I introduced a new entity NAME instead of using PERSON, maybe I thought it was not working well with full names...

Sorry, I must have missed your reply!

Yes, but you probably want to do this several times for all the labels you need, so you can actually train with all labels at the same time. You can do this in one dataset, or in separate ones that you merge later on. Separate datasets would allow you to run separate experiments (e.g. "Does the model get better if I do not include label X? Or is label Y the problem?"), but it might be overkill for your situation.

In ner.manual, the model is only used for tokenization, so it doesn't matter that much (assuming you're not changing the tokenization rules). But you might as well use your base model with vectors, just for consitency, because that's also the model you'll be training and updating later on. If you've been using that the whole time, it's easier to remember that this should be the base model.

Here, you want to be using the model you've pre-trained with ner.batch-train. After all, the goal is to see if your pre-trained model is solid enough to be improved in the loop.

If you were trying to add NAME on top of a model that already predicted PERSON for most of those tokens, that would definitely explain some of the issues you came across.

1 Like

Hi @ines, thanks again for the detailed response...

Yes, I will definitely try to go with separate datasets.

I have a weird problem when I try to use my vectors in ner.batch-train, after I train my vectors:

prodigy terms.train-vectors model1 source.jsonl --spacy-model en_core_web_sm --merge-nps --merge-ents

then I try to use it on ner.batch-train:

prodigy ner.batch-train personal_info_all_merged model1 -o personal_info_model4 --n-iter 10 --eval-split 0.2 --dropout 0.2

Loaded model1
Using 20% of accept/reject examples (878) for evaluation
Traceback (most recent call last):
File "/home/.../anaconda2/envs/work_env3/lib/python3.7/runpy.py", line 193, in _run_module_as_main
"main", mod_spec)
File "/home/.../anaconda2/envs/work_env3/lib/python3.7/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/home/.../anaconda2/envs/work_env3/lib/python3.7/site-packages/prodigy/main.py", line 259, in
controller = recipe(args, use_plac=True)
File "cython_src/prodigy/core.pyx", line 253, in prodigy.core.recipe.recipe_decorator.recipe_proxy
File "/home/.../anaconda2/envs/work_env3/lib/python3.7/site-packages/plac_core.py", line 328, in call
cmd, result = parser.consume(arglist)
File "/home/.../anaconda2/envs/work_env3/lib/python3.7/site-packages/plac_core.py", line 207, in consume
return cmd, self.func(
(args + varargs + extraopts), **kwargs)
File "/home/.../anaconda2/envs/work_env3/lib/python3.7/site-packages/prodigy/recipes/ner.py", line 426, in batch_train
examples = list(split_sentences(model.orig_nlp, examples))
File "cython_src/prodigy/components/preprocess.pyx", line 38, in split_sentences
File "cython_src/prodigy/components/preprocess.pyx", line 143, in prodigy.components.preprocess._add_tokens
KeyError: 85

  • Is this related to the fact that when building the dataset used here in the ner.batch-train I used different vectors en_core_web_sm instead of generated vectors?

Ah, sorry, I think this is related to the following issue:

We haven't shipped the fix for this yet, but setting --unsegmented on ner.batch-train should do the trick.

1 Like