Using the NER_manual interface to annotate text classification

ryanwesslen · September 9, 2022, 8:48pm

Great questions!

Can you provide an example or more details on what the type of text you're analyzing? What are attributes about it like sow's punctuation/grammar use?

For example, two types that have issues like this could be manually written notes (e.g., call center) or audio-to-text transcripts (but those usually have speaker breaks).

Also, how was it created? Sometimes this can be a helpful demarcation to break things up by how it was written (e.g., only text written by person X vs. Y)?

You mention you have paragraphs. I think of paragraphs as indicated by at least a new line break, maybe a tab. So I'm a little confused why you can't use that. Because otherwise you would have just 1 long paragraph / stream of text, not paragraphs, right?

One option is to create your own custom sentence segmentation model. This could work if there are maybe other artifacts like punctuation or symbols that can be used instead of periods. Once you do this, you can train a custom segmenter. The segments don't need to be actual sentences -- for example, I've used this on regulatory text where the mark tokens are different characters like "(a), (iii), or (IV)".

I'm not sure I understand what you're trying to accomplish with this. Initially, I thought why not spancat? But I remember we had previously discussed spancat so I suspect you've ruled it out.

Phrased differently, can you describe what would be the perfect model you'd want? It sounds like you want something that will classify very long segments of text (e.g., equivalent of 2-5 sentences worth of words, right?).

One helpful quote from the textcat docs:

However, if you have an annotation task where the annotator really needs to see the whole document to make a decision, that’s often a sign that your text classification model might struggle. Current technologies struggle to put together information across sentences in complex ways.

Even if you get long label spans across many words (many sentences worth of words), I think ner or spancat models would struggle anyways. Therefore, you may be better off just using textcat anyways (but still need to break it up some way instead of a huge stream of text.).

My hunch is if you may be doing this as a way to accomplish both subtasks simultaneously: classification and segmentation.

If this is true, then I'd recommend to break it up into two tasks/models:

sentence segmentation model
text classification model

This way, if you think carefully about a good annotation scheme for segmenting (step 1), when you get to step 2 (text classification) it's much quicker/easier to make the categorization decision (and likely way faster!). I would also expect better performance as you can optimize each of the two models whereas if you try to combine both tasks into 1 model, you may not get the same performance.

No, not off-the-shelf. ner.manual recipe will produce spans (see its output); but for training text classification training (i.e., TextCategorizer) you need labels. Perhaps this could work if you wanted to create a custom Python script (i.e., to convert the data).

I hope this helps - but otherwise, the best I can recommend is you experiment! I bet you could try out 2-3 of these annotation schemes quickly and find through trial-and-error which best accomplishes your goal.

Topic		Replies	Views
first annotation - can I switch mid-way from ner.manual to textcat? usage , ner , textcat	4	518	July 13, 2021
Extracting useful information from Job description ner , textcat , spancat	1	1570	January 24, 2023
textcat - by sentence or by whole document (3-5 paragraphs) textcat	3	670	November 25, 2019
Text Categorization at Document level textcat , best-practices	3	1162	February 6, 2019
Best approach for using ner manual and mark usage , ner , solved	22	2347	January 20, 2020

Using the NER_manual interface to annotate text classification

Related topics