Your message is quite detailed and covers several aspects of your NER pipeline. To make it more concise and clear, consider focusing on the key points and questions. Here's a refined version of your message:
Hello,
I'm developing a Named Entity Recognition (NER) pipeline to analyze job postings. This pipeline identifies various categories like job title, salary, division, sponsorship, graduation date, etc. Our standard approach involves labeling entire job postings, each with multiple terms.
Recently, I experimented with Prodigy's ner.teach function. When using sentence splitting, only a fraction of each job posting is labeled per annotation, so I am not sure if the model will be able to label everything with an input that is the unsegmented job posting. However, with the --unsegmented option, the annotations suggested by the model are often insufficient. For instance, in an Amazon job posting, the model might correctly identify 'Amazon' as the company, but it misses other important labels like division and job title.
Given this, I'm unsure of the best approach. Should I use --unsegmented, opt for sentence segmentation, or stick with ner.correct for a more comprehensive labeling? Currently, we have annotated about 200 job postings, each with around 10 labels.
Any advice on the most effective strategy would be greatly appreciated.
In principle the segmentation shouldn't have a big effect on NER predictions as the model looks at the narrow windows of tokens, at least for spaCy NER models. So it should be fine to train on segmented job ads and use full texts in production.
However if for some reason you notice a significant difference in the way your model annotates depending on the segmenting, I'd make sure the same segmenting is applied in training and production.
If annotating segmented job ads is the most efficient way to annotate, ideally you'd keep annotating with sentence segmentation and you would be applying the same segmentation in production.
If it's not possible to apply the same segmentation in production, I'd probably switch to ner.correct for the best possible labelling over entire job ads.
I have one more question, right now for one of my secondary ner pipeline I have over 30 labels and when I use ner. correct the user interface cannot load my text, how do I cope with that?
The truth is that the UI was not really designed to handle this many labels. But there's a reason to it as it is, likely, not the best idea to try to annotate this many labels at the same time.
This would be really taxing for the annotators, as they need to think about big data model with every annotation task and not the easiest task for the model ether. This thread by @ines explains very well why you might consider splitting your annotation in steps.
In your case it looks like you could have sever high level classes such as EXPERIENCE and DIVISION and once your model is capable to distinguish between these high level classes, you could set up follow-up annotations to reannotate EXPERIENCE and DIVISION into their fine-grained classes.
If you're interested in some more NER annotation good practice tips, this thread has plenty of relevant references on the topic of dealing with a high number of labels.
Thank you for the response, I will look into the thread. However, this is already the second pipeline of our NER model where the first model detects more general labels and the results are quite decent. Our annotators are also quite specialized in the task. Is there any way that the UI can show both our labels and the text, like adjusting font size of labels and text?
Thank you for the response, I will look into the thread. However, this is already the second pipeline of our NER model where the first model detects more general labels and the results are quite decent. Our annotators are also quite specialized in the task. Is there any way that the UI can show both our labels and the text, like adjusting font size of labels and text?
In that case you probably want to modify card's css via global_css setting in .prodigy.json.
You probably want to make it wider and perhaps set the max height so that it becomes scrollable but I'd like reiterate that it would not be our recommended way of dealing with a high number of labels!
You might also consider shortening the the labels names e.g. use EXP as prefix instead of full EXPERIENCE_LEVEL etc. Hopefully that is enough to avoid the need to scroll.