spancat with really large spans? (Identify sections in text)

I'm working with plain text job posts. These usually consist of several sections, like

  • COMPANY: description of the company and their mission
  • TASK: description of the role to be filled
  • SKILLS: description of the required skills (hard skills, soft skills, educational background, required certifications etc.)

Here's a pretty common example:

Very often, the categories correspond, like in the above example, to large sections of the text (meaning something around 20 % of the total document per section). Sometimes, though, not all of them are present. And sometimes, they are also a little mixed up -- say, two sentences of SKILLS here, then a paragraph of TASK there.

I figured a span categorizer would work best for this task. This is because categorization depends strongly on the outside context, i.e. one has to use a fairly large window of words surrounding the beginning and end of the span to determine whether or not this is the proper boundary, as well as to figure out what category the span is.

The problem I'm running into, however, is that training consistenly crashes. If I train using the CPU, the process gets killed (Out Of Memory). Even using a cloud machine with 200+ GB of RAM does not change this.

And if I train using the GPU, even with 80 GB of GPU-RAM, I get

CUDARuntimeError('cudaErrorIllegalAddress: an illegal memory access was encountered')

Based on what I found on this forum, I believe that perhaps my spans are just too large?

Is there a better approach I can take?

The essential problem I am trying to solve is for the model to reliably answer the question:

Give me everything that is being said in this text about TASK (i.e. what the person doing this job is going to be doing). And then give me everything that is being said in this text about SKILLS (i.e. what skills the company believes an applicant should have to perform in this role).

It's not an actual requirement that these be coherent sections of text, or that they don't overlap. I just tried it this way because I thought it was the easiest way to do the annotations, as well as the easiest way for the model to learn. At least with the latter, it seems, I was wrong.

Can you recommend a better approach?

Thank you.

1 Like

Hey @leobg
I'm also facing this issue with spanCat since long time, But i think even @ines don't have solution for this :sweat_smile:

1 Like

I wonder ... if the spans you're trying to detect are full sentences sometimes ... might it be easier to turn the problem into a classification problem instead? Spancat is indeed designed to handle longer spans than NER, but spans the size of multiple sentences is pushing it.

Thanks @koaning.

Yes, I thought about turning the problem into a classification problem as well.

I see two downsides:

  1. It makes annotation harder for me as the human in the loop. Selecting three large sections per job post is easy. Annotating sentences one by one is hard.

  2. Whether something is a skill that the employer requires or is merely a description of the job can sometimes not be determined from the structure or content of the sentence in and of itself. That information sometimes really just comes from where that sentence is located in the overall job post.

That doesn't mean it wouldn't be a viable way.

I've also thought about perhaps training the span cat just on the boundaries between my sections. Essentially asking,

What is the first sentence of TASK, if any?
What is the first sentence of SKILL, if any?

What do you think of that?

I'd be happy to try other approaches. But I wanted to first make sure that the training problems I ran into really are due to the length of my span annotations. And also that there isn't any "best practices" workaround for dealing with large spans.

Perhaps @ines has some thoughts on this?

Thank you all.

I found a dialogue on the forum that might be inspirational here.

It's a different problem, but it highlights another two-step approach to rethinking spans.

That said, reading your reply still makes me think that textcat might be the simplest way forward, albeit on paragraphs instead of sentences. While I like your idea of using NER to detect the start of a section, I wonder if you might be able to leverage that this always starts on a newline, which suggests a heuristic might be better than a ML model.

textcat might be the simplest way forward, albeit on paragraphs instead of sentences

Interesting. So not categorize sentences, but paragraphs. I like that idea!

I actually did something like that on a similar project, where I had to segment court rulings into the factual and the legal part. I got pretty good results with it, even though I only used a "dumb" bag-of-words type of fastText classifier.

Thanks @koaning for looking around and finding this!

BTW, off-topic... But since you are the guys behind spaCy and prodigy, shouldn't this forum be supercharged with some kind of AI assistant that automatically posts suggestions like you just did?

I was just thinking that you guys probably prefer spending your time coding rather than thinking about other people's problems -- especially when those problems have already been answered in the past.

I know that Discourse already does some crude form of suggesting existing topics. But understanding a question semantically, and fetching not just a matching thread from the past, but also the most suitable post and paragraph from that thread as potential answer would be one step further.

Where else should this exist if not here! :rocket:

shouldn't this forum be supercharged with some kind of AI assistant

That's one angle to think about it but we also really like to be involved. The forum doesn't just offer a way for users to get solutions to their problems, it also gives us meaningful feedback and might even help us understand missing features of our products. It's a better experience for both by keeping a human in the loop. I also think there's a risk of making a bot that makes the experience much worse too.

Related; are you aware of the spaCy discussions board? This forum is mostly meant for Prodigy questions, ever since Github released the discussions feature we've moved some of the spaCy conversations there.

"dumb" bag-of-words type of fastText classifier.

Bag-of-words models are always a good benchmark to have around, so I wouldn't call them "dumb" :wink: . You might also want to have a look at the scikit-learn ecosystem as well, since they also offer tf/idf tricks. I should admit though that the bag-of-words approach might not work for every language out there and English does seem to be one of the "easier" languages for this.

1 Like

Hi,

Any news on this issue? I want to do more or less the same. Instead of identifying the whole skills or tasks paragraph from a job ad, I’d like to identify every single task. However, my training process gets killed too.

We have already collected 1000 annotated job ads using the Prodigy spancat recipe. Do you know any way to transform these annotations into BIO NER annotations so that I can cast this as a token classification problem? We did this in the past and it worked - at least it didn’t crash.

Best,
Oliver

Hi Oliver.

Could you share more details on your CUDA error? How large are your documents? What base model are you using? I certainly wouldn't mind understanding this issue better. If you have any details on your hardware that would also elp

I'm not familiar with BIO NER, so I can't provide much help there.

Could you share the train command that you've tried? Related; what happens when you train using a non-transformer model?