hi @ChinmayR!
Great questions!
Can you provide an example or more details on what the type of text you're analyzing? What are attributes about it like sow's punctuation/grammar use?
For example, two types that have issues like this could be manually written notes (e.g., call center) or audio-to-text transcripts (but those usually have speaker breaks).
Also, how was it created? Sometimes this can be a helpful demarcation to break things up by how it was written (e.g., only text written by person X vs. Y)?
You mention you have paragraphs. I think of paragraphs as indicated by at least a new line break, maybe a tab. So I'm a little confused why you can't use that. Because otherwise you would have just 1 long paragraph / stream of text, not paragraphs, right?
One option is to create your own custom sentence segmentation model. This could work if there are maybe other artifacts like punctuation or symbols that can be used instead of periods. Once you do this, you can train a custom segmenter. The segments don't need to be actual sentences -- for example, I've used this on regulatory text where the mark tokens are different characters like "(a), (iii), or (IV)".
I'm not sure I understand what you're trying to accomplish with this. Initially, I thought why not spancat
? But I remember we had previously discussed spancat so I suspect you've ruled it out.
Phrased differently, can you describe what would be the perfect model you'd want? It sounds like you want something that will classify very long segments of text (e.g., equivalent of 2-5 sentences worth of words, right?).
One helpful quote from the textcat docs:
However, if you have an annotation task where the annotator really needs to see the whole document to make a decision, that’s often a sign that your text classification model might struggle. Current technologies struggle to put together information across sentences in complex ways.
Even if you get long label spans across many words (many sentences worth of words), I think ner
or spancat
models would struggle anyways. Therefore, you may be better off just using textcat
anyways (but still need to break it up some way instead of a huge stream of text.).
My hunch is if you may be doing this as a way to accomplish both subtasks simultaneously: classification and segmentation.
If this is true, then I'd recommend to break it up into two tasks/models:
- sentence segmentation model
- text classification model
This way, if you think carefully about a good annotation scheme for segmenting (step 1), when you get to step 2 (text classification) it's much quicker/easier to make the categorization decision (and likely way faster!). I would also expect better performance as you can optimize each of the two models whereas if you try to combine both tasks into 1 model, you may not get the same performance.
No, not off-the-shelf. ner.manual
recipe will produce spans
(see its output); but for training text classification training (i.e., TextCategorizer) you need labels
. Perhaps this could work if you wanted to create a custom Python script (i.e., to convert the data).
I hope this helps - but otherwise, the best I can recommend is you experiment! I bet you could try out 2-3 of these annotation schemes quickly and find through trial-and-error which best accomplishes your goal.