Using the NER_manual interface to annotate text classification

Hi!

I'm trying to classify different paragraphs from a document into different labels. There are no overlapping labels.

I have a large dataset and there is no clear starting or ending for these paragraphs so it would be very difficult to separate the document into paragraphs and then do a textcat.manual annotation.

I was wondering if I could use something like a NER.manual annotation interface to do a text classification annotation, where I can just highlight longer paragraphs from document?

I tried changing the view-id on the textcat.manual recipe to the ner_manual annotation interface but it didnt work.

Or can I just highlight those paragraphs using the ner.manual recipe and then train it using the spacy textcat pipeline? Would the dataset be compatible?

hi @ChinmayR!

Great questions!

Can you provide an example or more details on what the type of text you're analyzing? What are attributes about it like sow's punctuation/grammar use?

For example, two types that have issues like this could be manually written notes (e.g., call center) or audio-to-text transcripts (but those usually have speaker breaks).

Also, how was it created? Sometimes this can be a helpful demarcation to break things up by how it was written (e.g., only text written by person X vs. Y)?

You mention you have paragraphs. I think of paragraphs as indicated by at least a new line break, maybe a tab. So I'm a little confused why you can't use that. Because otherwise you would have just 1 long paragraph / stream of text, not paragraphs, right?

One option is to create your own custom sentence segmentation model. This could work if there are maybe other artifacts like punctuation or symbols that can be used instead of periods. Once you do this, you can train a custom segmenter. The segments don't need to be actual sentences -- for example, I've used this on regulatory text where the mark tokens are different characters like "(a), (iii), or (IV)".

I'm not sure I understand what you're trying to accomplish with this. Initially, I thought why not spancat? But I remember we had previously discussed spancat so I suspect you've ruled it out.

Phrased differently, can you describe what would be the perfect model you'd want? It sounds like you want something that will classify very long segments of text (e.g., equivalent of 2-5 sentences worth of words, right?).

One helpful quote from the textcat docs:

However, if you have an annotation task where the annotator really needs to see the whole document to make a decision, that’s often a sign that your text classification model might struggle. Current technologies struggle to put together information across sentences in complex ways.

Even if you get long label spans across many words (many sentences worth of words), I think ner or spancat models would struggle anyways. Therefore, you may be better off just using textcat anyways (but still need to break it up some way instead of a huge stream of text.).

My hunch is if you may be doing this as a way to accomplish both subtasks simultaneously: classification and segmentation.

If this is true, then I'd recommend to break it up into two tasks/models:

  1. sentence segmentation model
  2. text classification model

This way, if you think carefully about a good annotation scheme for segmenting (step 1), when you get to step 2 (text classification) it's much quicker/easier to make the categorization decision (and likely way faster!). I would also expect better performance as you can optimize each of the two models whereas if you try to combine both tasks into 1 model, you may not get the same performance.

No, not off-the-shelf. ner.manual recipe will produce spans (see its output); but for training text classification training (i.e., TextCategorizer) you need labels. Perhaps this could work if you wanted to create a custom Python script (i.e., to convert the data).

I hope this helps - but otherwise, the best I can recommend is you experiment! I bet you could try out 2-3 of these annotation schemes quickly and find through trial-and-error which best accomplishes your goal.

Hey Ryan, thanks again for helping me out! :smile:

What I'm trying to do is break down a job description into different sections, such as role, company description, job (the work the candidate will do), and requirements.

The problem here is that since the data is scraped, and how varied and unstructured job descriptions get, the sections are not together in one continuous paragraph. There are random line breaks, missing punctuation marks, and a lot of times no clear indication that a new section has started. However, there are never different classifications mixed in a paragraph or sentence, if that makes sense, like the description will always talk about a job first, then the requirement, or vice versa, almost never job, requirement, then job again,

This is how the data can look, more times than not.


I tried a spancat initially but it didn't work well, as it only extracted spans a maximum of 3 words long, after seeking help here, I learned that I need to increase the size of the n-grams suggested, which upon doing so, filled up the gpu memory while training in a jiffy. I could only go a maximum of 15 n-grams.

It actually doesn't need the whole document to make a decision, I'm just not sure how to break my document down.

This is something I would like to do ideally, Actually, I have a question. if I did use a sentence segmenter and split my job description into sentences, do you reckon a textcat would be able to classify those sentences? And here's the main question I have, after I have trained a textcat successfully, If I give it a complete job description, would it be able to identify different sections on the description like a spancat? Or would I always need to segment into sentences to use the textcat?

What I'm trying to do is break down a job description into different sections, such as role, company description, job (the work the candidate will do), and requirements.

Am I going in the right direction with what I want to accomplish?

hi @ChinmayR!

Thanks for the context. Sorry for the delay. I don't have much more to add. I think you're definitely moving in the right direction.

Could you partition (segment) the text by new line breaks? These examples seem to have different new line breaks (either four or five). I suspect you've already tried this as it's obvious.

I would expect that textcat would work much better on segmented data than the raw long text.

Yes - I still would expect that spancat could also work on the shorter text (so long as your segments don't separate out context that would be needed for the spancat).

One last item is I would also recommend adding rules either on the front end (e.g., if you can exploit some logic you know) or at the end to extract values after identifying spans. For the extraction at the end, I've seen a few projects for information extraction that used spancat + rule-based matcher. The spancat identifies spans that provide general description (e.g., "salaries range from $100,000 to $150,000") but then using matcher rules to extract the information that's needed ("$100,000" and "$150,000") to then populate a range like [100000,150000]. Alternatively, you could also use regex to extract sub-information from spans but I find the matcher is more flexible.

Thanks for the help Ryan! I really appreciate it.