Parsing/Identifying sections in job descriptions

Hi everyone,

I’m trying to solve quite a difficult problem - building a generic parser for job descriptions. The idea is, given a job description, the parser should be able to identify and extract different sections such as job title, location, job description, responsibilities, qualifications etc.

A rule based approach doesn’t work since the scenario is too generic. My next approach was to train a custom NER classifier; I’ve done this numerous times before. However, I’m running into several problems:

  1. The entities can be very small in size (location, job title etc.) or very large (responsibilities, qualifications etc.). I’m not sure how well NER will work if the entities are several lines or a paragraph long? Most of the use cases I’ve seen are those in which the entities aren’t longer than a few words max. Does Spacy’s NER work well if the text of the entities I want to identify is quite long in size? (I can give examples if required to make it clearer).
  2. Is there any other strategy besides NER that I can use to parse these job descriptions as I’ve mentioned?
  3. I’ve been studying Prodigy for a week or so, and from what I can gather, it serves as an annotation and training tool. The idea is that as I annotate data for NER, the model will keep learning alongside and try to make the annotating easier as the process goes. Is this how it works? The pricing is quite expensive and I want to ensure that Prodigy does what I think it does before asking my client to make the purchase.

Any help here would be greatly appreciated. I’ve been banging my head along different walls for a few months, and I have made some progress, but I’m not sure if I’m on the right track, or if a better approach exists.

Hi! I remember answering some questions about a job description project a while ago, so maybe this thread might be useful, too:

By definition, a named entitiy is usually a "real world object" – like a person, an organisation, a product or other distinct names like that. This is also what the underlying statistical model is optimised for. If your goal is to label longer phrases or even paragraphs, this is not typically an end-to-end problem for named entity recognition.

Instead, your task might be a better fit for a combined approach using the entity recognizer to label more generic entity types, the dependency parse to select the whole phrase you're interested and maybe the text classifier to assign top-level topics to the extracted paragraphs.

I explain this idea in more detail in this thread, which should be pretty relevant to your use case as well:

Yes, that's one workflow we've built in. You can also use Prodigy to label your data 100% from scratch and without a model in the loop. The data can be exported in a straightforward JSONL file, so you don't have to use spaCy and can use the annotations with any other tool or process.

In general, Prodigy's philosophy is to make annotation faster and more efficient by breaking larger tasks down into smaller decisions and automating/scripting as much as possible. Another thing we advocate for is running smaller experiments and iterating on your data. NLP is pretty experimental so you just need to try lots of stuff. For example, you might want to try out different labelling strategies to see which one is most promising. Can you teach the named entity recognizer your new entity definition, or does it make more sense for a text classification task? Do fine-grained or more generic categories work better? Is the data suitable and does the model improve if you add more data from source X? If you can just try these things quickly by labelling a few examples and running a few experiments, you'll ideally spend less time shooting in the dark and are able to focus on the most promising solution quicker.

Btw, if you haven't seen it yet, you might also want to check out our prodigy-recipes repo, which shows some examples of how you can script Prodigy to build different annotation workflows and do your own automation:

That said, if you do find that Prodigy just isn't the right tool for you, we're happy to issue a refund. It's a pretty specific developer tool for a pretty specific use case, and we believe in keeping the scope focused. So Prodigy can't be the right tool for everything and everyone – and that's okay :slightly_smiling_face:

How might I request a refund? I think the platform has incredible potential, but I would prefer to stick with a solution that allows me to train NER models at scale, rather than by clicking in an interface, word for word.

We're always happy to provide a full refund if you email us at contact@explosion.ai.

The goal of Prodigy is to allow you to annotate your own data in order to increase the data quality, which in turn should improve the quality of any model that is trained on it. This annotation indeed involves manual effort. There are some recipes that might allow you to label everything with keyboard shortcuts though. If you're interested, you might want to check out the ner.teach recipe. It gives you a binary interface that's powered by active learning.