I want to train an NER model which I will use to recognize 3 entities (Company(or ORG), time period(possibly DATE), and LOCATION) on lines of text extracted from peoples CVs, mainly of experience section. So, I would like to hear your opinion on this, is a good idea to just start with ner.teach recipe and one of the datasets (e.g. en_core_web_lg) or do I need first to train some terms like the company and feed with IT companies? Should I introduce a new term COMPANY or go with the standard ORG?
And finally, for the period of time, I need something like e.g. 12.2012 - X.2018 to be recognized as a period of time and also '1 year and 6 months' also to be recognized, should we go here with the DATE entity or train a new one?
This was the answer of Matthew on a direct exchange via email before I knew about this forum:
To answer your question, problems do differ, so it's hard to tell whether ner.teach will be best. I would say using the
ner.make-gold
recipe to get an evaluation set will be a good first step. Then you can check quality of the current model on your data, and as you try different ways of improving the accuracy, you'll have a repeatable experiment.I think training a new class for the periods of time will be useful, as otherwise you'll conflict with the DATE definition in subtle ways. Note that ranges of time are actually very complex! You might end up needing to recognise the start and end point separately.
Since I have additional questions and also community might benefit from this discussion I decided to post them here.
-
I used the
ner.make-gold
recipe with all the entities(ORG, DATE, GPE, LOC, POSITION) and in return I've got a dataset with 2000 annotation and then using thener.batch-train
with the gold dataset I've got an accuracy of 67%, but I did not understand how to use it as an evaluation set and proceed further. During the annotation I got the impression that the model was doing good with the range of dates and many times got them right but with the POSITION not that much and many times it failed even in the same situation(perhaps because at this point I didn't add a new entity that I will talk below on the second question) -
I wanted to add another entity like ROLE (or POSITION) and train it using the
terms.teach
recipe with word vectors (e.g. a large spaCy model) to create a terminology list of examples of the new entity type and for example start off with the seed terms: "project manager", "systems analyst", "software engineer", "data engineer" etc. I followed the lecture "Training a new entity type on Reddit comments" where you trained the DRUG entity, but in my case, I keep getting only one-word suggestions that are relevant but I would expect also two-word suggestions.
Here are some example of the data that we use:
{"text": "Master of Business (Strategic management) with multidisciplinary skills and over 15 years experience"}
{"text": "in Financial Services. Strong background in project management (PRINCE2-certified) and"}
{"text": "requirements engineering."}
{"text": "Currently working toward certification as Data Protection Officer (DPO) EU GDPR."}
{"text": "Januar 2017 - Present"}
{"text": "Requirements Engineer AEI /Tax Reporting at Credit Suisse"}
Thanks