I'm working with plain text job posts. These usually consist of several sections, like
- COMPANY: description of the company and their mission
- TASK: description of the role to be filled
- SKILLS: description of the required skills (hard skills, soft skills, educational background, required certifications etc.)
Here's a pretty common example:
Very often, the categories correspond, like in the above example, to large sections of the text (meaning something around 20 % of the total document per section). Sometimes, though, not all of them are present. And sometimes, they are also a little mixed up -- say, two sentences of SKILLS
here, then a paragraph of TASK
there.
I figured a span categorizer
would work best for this task. This is because categorization depends strongly on the outside context, i.e. one has to use a fairly large window of words surrounding the beginning and end of the span to determine whether or not this is the proper boundary, as well as to figure out what category the span is.
The problem I'm running into, however, is that training consistenly crashes. If I train using the CPU, the process gets killed (Out Of Memory). Even using a cloud machine with 200+ GB of RAM does not change this.
And if I train using the GPU, even with 80 GB of GPU-RAM, I get
CUDARuntimeError('cudaErrorIllegalAddress: an illegal memory access was encountered')
Based on what I found on this forum, I believe that perhaps my spans are just too large?
Is there a better approach I can take?
The essential problem I am trying to solve is for the model to reliably answer the question:
Give me everything that is being said in this text about TASK (i.e. what the person doing this job is going to be doing). And then give me everything that is being said in this text about SKILLS (i.e. what skills the company believes an applicant should have to perform in this role).
It's not an actual requirement that these be coherent sections of text, or that they don't overlap. I just tried it this way because I thought it was the easiest way to do the annotations, as well as the easiest way for the model to learn. At least with the latter, it seems, I was wrong.
Can you recommend a better approach?
Thank you.