Intro:
- I want to extract Last Name and First Name in one entity from resume.
- I am working with Resume (Curriculum Vitae) documents.
- I am working on improvement of existing “PERSON” category.
Problem formulation:
en_core_web_sm model does not recognize Last Name and First Name as one token at the beginning of learning. The model recognizes a lot of irrelevant tokens as PERSON entity at the beginning of learning. Fortunately sometimes the model recognizes First Name as a PERSON (see the screenshot).
Question:
Which option is better?
Option 1:
Accept partially correct entities at the beginning o learning. It means press green button for the case illustrated on the screenshot. The approach should allow model to pay more attention on relevant tokens (I mean the model will pay more attention on real Last Names and First Names). So the model will pay less attention on irrelevant tokens like “Java”, “Visio”, “Jira” and so on.
As soon as the model starts pay more attention into tokens related to the real Last Names and First Names, I should start rejecting partially correct predictions. So I will try to explain the model that it should learn two token entities.
Option 2:
Reject partially correct entities. So the model will start learning two token entities, but in meantime I will need to reject a lot of irrelevant suggestions also. I will need to reject a lot of irrelevant prediction because of the model will try to understand what I want it to learn. So it will suggest a lot of irrelevant entities like “Java”, “Visio”, “Jira” and so on.
Thank you in advance for choosing the best option and explaining your choice.