NER Annotaton Strategy

Sharing this approach in response to @mv3's question in a previous post:

My NLP project leverages a structured, three-stage Named Entity Recognition (NER) approach, using Prodigy for annotating job postings.

The first stage involves training an NER model to categorize broad sections of the job postings. This is achieved by manually annotating large spans of text—such as entire sentences or paragraphs—within Prodigy, using high-level labels that capture the general content of these sections. For instance, identifying distinct segments like 'Requirements', 'Responsibilities', or 'Qualifications'. This sets the stage for a more detailed analysis.

The second stage involves applying intermediate-level labels to the text. These labels are 'Behavior', 'Ability', 'Credential', 'Knowledge', 'Experience', 'Softskill', and 'Techskill'. At this level, I'm still working within a relatively broad scope but starting to zoom in on the specifics of the job description. This is where the model learns to distinguish between the types of qualifications and traits sought by employers.

The third and final stage is where the model gets even more specific. Here, I apply granular labels such as 'hardware', 'software', 'degree', 'certification', and 'years of experience'. These labels delve into the particulars within the previously identified intermediate sections. By training the model to understand and identify these detailed entities, I can extract very specific information from the job postings.

This multi-tiered approach allows the NER model to process complex job postings in a layered fashion, improving its accuracy and utility. The initial broad classification helps in managing the complexity by segmenting the data, and the subsequent, more detailed labeling captures the fine-grained information necessary for a comprehensive analysis.

Stage 1

Stage 2 (Now within Candidate-Qualifications)

Stage 3
Currently in development

1 Like

Hi, @kylebigelow, thank you for providing an in-depth strategy. If I'm reading this right, the approach leverages position within the text, then their generic categorization, followed by fine-grained labels. The spans however appear to be arbitrarily long. Am I correct that the highlighted annotations are what constitutes an entity? In light of my use case, I've stated experimenting with training a spancat model (with multiple sub-spans) instead of using the brittle NER which isn't designed for sub-entities within entities. I'd have liked a way to use my entities within spancat training but I presently do not see a way of connecting the two save for the pattern files used.

1 Like

Good point, I forgot to mention I am using the GPU (transformer) config during training. Yes, the highlighted sentences or phrases are annotated using the ner.correct interface.

I see what you're trying to do now. The pattern functions in Prodigy do not have full parity with spaCy. I think the merge you are trying to do seems appropriate but it may not be seamless. For each dataset, did you export and then rehash during import?

Yes, I was able to accomplish the rehashing of the patterns using the trick suggested in the original thread.

1 Like