NER Annotaton Strategy

kylebigelow · December 8, 2023, 11:43pm

Sharing this approach in response to @mv3's question in a previous post:

My NLP project leverages a structured, three-stage Named Entity Recognition (NER) approach, using Prodigy for annotating job postings.

The first stage involves training an NER model to categorize broad sections of the job postings. This is achieved by manually annotating large spans of text—such as entire sentences or paragraphs—within Prodigy, using high-level labels that capture the general content of these sections. For instance, identifying distinct segments like 'Requirements', 'Responsibilities', or 'Qualifications'. This sets the stage for a more detailed analysis.

The second stage involves applying intermediate-level labels to the text. These labels are 'Behavior', 'Ability', 'Credential', 'Knowledge', 'Experience', 'Softskill', and 'Techskill'. At this level, I'm still working within a relatively broad scope but starting to zoom in on the specifics of the job description. This is where the model learns to distinguish between the types of qualifications and traits sought by employers.

The third and final stage is where the model gets even more specific. Here, I apply granular labels such as 'hardware', 'software', 'degree', 'certification', and 'years of experience'. These labels delve into the particulars within the previously identified intermediate sections. By training the model to understand and identify these detailed entities, I can extract very specific information from the job postings.

This multi-tiered approach allows the NER model to process complex job postings in a layered fashion, improving its accuracy and utility. The initial broad classification helps in managing the complexity by segmenting the data, and the subsequent, more detailed labeling captures the fine-grained information necessary for a comprehensive analysis.

Stage 1

Stage 2 (Now within Candidate-Qualifications)

Stage 3
Currently in development

mv3 · December 9, 2023, 7:32pm

Hi, @kylebigelow, thank you for providing an in-depth strategy. If I'm reading this right, the approach leverages position within the text, then their generic categorization, followed by fine-grained labels. The spans however appear to be arbitrarily long. Am I correct that the highlighted annotations are what constitutes an entity? In light of my use case, I've stated experimenting with training a spancat model (with multiple sub-spans) instead of using the brittle NER which isn't designed for sub-entities within entities. I'd have liked a way to use my entities within spancat training but I presently do not see a way of connecting the two save for the pattern files used.

kylebigelow · December 9, 2023, 10:38pm

Good point, I forgot to mention I am using the GPU (transformer) config during training. Yes, the highlighted sentences or phrases are annotated using the ner.correct interface.

I see what you're trying to do now. The pattern functions in Prodigy do not have full parity with spaCy. I think the merge you are trying to do seems appropriate but it may not be seamless. For each dataset, did you export and then rehash during import?

mv3 · December 10, 2023, 2:29am

Yes, I was able to accomplish the rehashing of the patterns using the trick suggested in the original thread.

Topic		Replies	Views
Annotate for NER and classification at the same time ner , best-practices	1	525	October 19, 2021
Questions about ner.teach and ner.correct usage , ner	10	379	January 11, 2024
Annotating text with multiple labels simultaneously usage , ner , solved	1	424	April 20, 2020
Parsing/Identifying sections in job descriptions usage , ner , custom	3	3259	June 16, 2022
NER with dozens of entities usage , ner	4	851	April 16, 2021

NER Annotaton Strategy

Related topics