Custom NER model

(Jishnu Nair) #1

Hello Guys!
I have this problem of extracting entities from text let’s say name in resumes but due to the occurrence of many names the model gets confused and result are not up to mark. Also I tried annotation with some context but results are still not satisfactory. Any idea how to differentiate between the values obtained for same entities?
I have a made a custom spacy model from scratch using annotation.

  1. Also in order to solve the above problem I tried to find the sentence/context associated with the value obtained to differentiate between the values obtained. But as most of the cases are unique, I am not able to generalize it. Any suggestions?

(Matthew Honnibal) #2

I just want to confirm quickly the type of task you’re doing. Which of the following sounds closest to what you need the model to output:

  1. Mark text that is somebody’s name.
  2. Find names with a particular relationship to document, for instance “submitter’s name”, “reference name”, etc
  3. Link names to database entries, e.g. “This Mark Smith is Mark_Smith_195667”, or even just “Mark and Mr. Smith both refer to the same Mark Smith, while this reference to M. Smith refers to someone else”.

(Jishnu Nair) #3

Okay here is an example. I have three different types of dates that needs to be extracted. lets say for a contract it is start date,end Date and date of agreement. The annotation which for the first two was done with some context assosciated with it. But for date of agreement since there wasn’t any it was done as a normal date. So the result for date of agreement gets confused with the start date and end date.

And to answer your question, the second option that you have metioned seems most appropriate.

Thank You


(Jishnu Nair) #4

And also in Spacy I tried using prohibit_action() function to remove a pretrained entity but apparently it’s not working. Is it sill active? If not can you suggest a method to remove a pretrained entity from the model?


(Matthew Honnibal) #5

prohibit_action() is experimental, but it should work. What happened when you tried it?

I think NER is probably not a great fit for the type of task you’re doing. The NER model reads the sentence left-to-right, so if the context that disambiguates whether the entity is a start or end date occurs after the entity, the model will have no chance to get it right.

I think you should separate the task into two processes: one to recognise the dates, and another to recognise whether it’s a start date or an end date. If you have a lot of dates which are neither, you probably want to have a sentence classification model that tells you whether the sentence is relevant or not.

The design of how the date recognition and role classification components should work is an open question as well. You should probably start by just doing ner.manual annotation for a small evaluation set. This way you can get some experience with what’s common in the text, and at the end of it you’ll have a pilot evaluation set. Probably the most efficient next step would be to write matcher patterns, and evaluate them on date recognition against your manual annotations. You might also find it useful to develop rules for whether it’s a start or end date as well.

When considering whether to create a rule-based solution or a machine learning-based one, it’s worth imagining the different effort vs efficacy curves the two approaches might have on your problem. It’s usually the case that eventually a machine learnt solution will overtake a rule-based one, and on some problems, the machine learning solution pretty much dominates. For instance, if you’re classifying news articles by topic and spend five minutes annotating, you’ll almost certainly do better than spending five minutes making up rules. But on many problems, this isn’t the case. For date recognition within a specific domain, you’ll probably do better with rules until you have quite a lot of annotations. If the texts are well edited, the rules might actually be perfect, as you can exactly reverse engineer the rules which went into generating the text.

If rule-based systems are good early on in your problem, it’s still worth creating them as a bootstrapping process. You can use the rule-based system to help you annotate, making it much quicker to get to the point where the machine learning solution can take over. The key to doing this effectively is to switch between making evaluation data and doing the system work (on either the training data, rules or hyper-parameter selection). This way you can known whether you’re moving in the right direction, and you can figure out what to do next, and when to switch tactics.

1 Like

(Jishnu Nair) #6

Thank you so much for the reply.

For the

i used your sample code to simply check if it’s working:

import spacy
nlp = spacy.load(‘en_core_web_sm’)

‘spacy.syntax.ner.BiluoPushDown’ object has no attribute ‘prohibit_action’


(Matthew Honnibal) #7

Hm, I guess I did remove that. The following should work in v2.1, which will be supported by v1.8 of Prodigy, which we expect to release soon. This is still internals though — please mark in your code that this API can change, so your code may need to be updated.

If you do ner = nlp.get_pipe("ner") you’ll get an instance that’s a subclass of the spacy.syntax.nn_parser.Parser class. Once the model is loaded, the ner.model attribute gives you an instance of spacy.syntax._parser_model.ParserModel. This class has an attribute unseen_classes that is a set of class IDs. If you add the class ID to this, you should prevent the class from being predicted:

ner = nlp.get_pipe("ner")
class_ids = {name: i for i, name in enumerate(ner.move_names)}